0:00
/
0:00
Transcript

Inside Zhipu: How One of China’s AI Tigers Thinks About Building Models [Livestreams]

A candid conversation on tradeoffs, scale, and what actually matters.

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction. We run on a “pay what you can” model—so if you believe in the mission, there’s likely a plan that fits (over here).

Every subscription helps me stay independent, avoid clickbait, and focus on depth over noise, and I deeply appreciate everyone who chooses to support our cult.

Help me buy chocolate milk

PS – Supporting this work doesn’t have to come out of your pocket. If you read this as part of your professional development, you can use this email template to request reimbursement for your subscription.

Every month, the Chocolate Milk Cult reaches over a million Builders, Investors, Policy Makers, Leaders, and more. If you’d like to meet other members of our community, please fill out this contact form here (I will never sell your data nor will I make intros w/o your explicit permission)- https://forms.gle/Pi1pGLuS1FmzXoLr6


This conversation with Zhipu AI, one of China’s leading AI labs, was recorded just as their latest models were about to launch. At the time, I was pulled in several directions—visa issues, mobilizing an open-source release of our reasoning infrastructure, and other unavoidable interruptions—which slowed down publication. The delay has nothing to do with the substance here; the ideas and exchanges you’re about to read are still timely, real, and indicative of how serious AI builders in China are approaching model development and strategic tradeoffs.

What follows is a candid, technical, and wide-ranging conversation that doesn’t lean on marketing language or shallow soundbites. I’m grateful to Zhipu’s team—especially Zixuan—for engaging thoughtfully and generously. For readers who want to follow their work more directly, you can find Zixuan’s posts and perspectives on X (Twitter) over here and follow Z AI here.

PS: (there is a blank space from 44:35 to 47:30 as Zixuan had to change rooms; skip that part or just enjoy the silence).

Before we get deep into the conversation, here is your regular reminder that we have a new foster cat that’s ready to be adopted. I call him Chipku (Hindi for clingy; his government name is Jancy), and as you might guess, he’s very affectionate. I’ve trained him to be better around animals and strangers, and he’s perfect for families that already have some experience with cats. We sleep together every day, and waking up to him is one of the nicest feelings. If you’re around New York City, adopt him here (or share this listing with someone who might be interested).

Enlarge Jancy, a Adoptable Domestic Short Hair in New York, NY image 1/6

Now enjoy the convo.

Companion Guide to the Livestream: Zixuan Li on Zhipu AI’s “Good Enough” Bet

This guide expands the core ideas and structures them for deeper reflection. Watch the full stream for tone, nuance, and side-commentary.


1. The Mercedes-Benz Strategy in a Rolls-Royce Race

The Event — Zixuan Li, Head of Global Services at Zhipu AI, laid out a positioning thesis that cuts against the dominant narrative in AI: Zhipu is not trying to build the best model. They’re trying to build the best model for what 90% of users actually do.

Why this matters — The frontier AI race has a specific shape right now. OpenAI, Anthropic, Google, and to some extent DeepSeek are competing for the absolute ceiling — IMO-level math, PhD-level science, the hardest possible reasoning tasks. Zhipu looked at that race and asked a different question: who are our actual users, and what are they actually doing?

The answer: high school students checking homework with vision models. Developers building side projects. People who need coding assistance, role-playing, translation. Not scientists pushing the boundary of mathematical reasoning.

Zixuan framed it explicitly: “Someone wants to make a Rolls-Royce, but we decided to make a Mercedes-Benz.” A Mercedes serves 90% of people. And critically, it serves them at a price point and accessibility level that changes the adoption math entirely.

The historical pattern — This isn’t a new playbook. The most transformative moments in technology happen when someone takes a powerful capability and makes a 60%-as-powerful version available at a tenth of the cost. Smartphones weren’t better than computers — they were cheaper, more portable, and “good enough.” That constraint unlocked an entire ecosystem of apps and interfaces nobody predicted. Zhipu is betting the same dynamic applies to AI models.


2. Why GLM-4.5 Is So Good at Agentic Tool Use (And Why That’s Not an Accident)

The Event — Zixuan walked through how Zhipu studied products like Manus and Claude Code to understand what makes agentic workflows actually work — then reverse-engineered those insights into their training process.

The mechanism — Before agentic products hit the market, most people only saw AI “thinking” — reasoning tokens, chain-of-thought. What Manus and Claude Code showed was something different: an AI that acts. It searches, evaluates, searches again, writes HTML, renders a presentation. Multiple tool calls orchestrated in sequence, with the model deciding when it has enough information to proceed.

Zhipu didn’t just observe this — they decomposed it. For every agentic task, they mapped the trajectory: what tools are needed, in what order, with what thinking process between each step. Then they built training data around those trajectories. Not generic tool-use examples. Scenario-specific trajectories that mirror how humans actually accomplish tasks.

This is why GLM-4.5 performs well on agentic benchmarks despite being a smaller lab’s model. They didn’t try to build general intelligence that happens to be good at tool use. They studied what tool use looks like in practice, then specifically trained for it.

The insight most people miss — Zixuan made a point that deserves emphasis: the failure mode in most agent implementations isn’t the model or the system prompt. It’s the tool definitions. Outputs that pollute the context window. Overlapping tool descriptions that confuse the model about when to use what. You can’t put 10,000 tools in a system prompt. You have to understand which tools matter for which scenarios, and that understanding comes from studying real user behavior — not from benchmarks.


3. AutoGLM: The Phone Agent That Spawned an Ecosystem in 48 Hours

The Event — Zhipu built AutoGLM, which Zixuan described as the world’s first phone-use agent. They open-sourced the Android version. Within two days, the community had built an iOS port.

How it works — AutoGLM operates by understanding phone interaction trajectories. To order an Uber, you open the app, enter your location and destination, choose a service tier, and wait for a driver. Each of those steps involves navigating between screens, pressing buttons, reading and entering text. Zhipu labeled massive amounts of this data — how people transition from one screen to another, how they interact with buttons, how they navigate between apps.

The training data isn’t just screenshots. It’s the paths between screenshots — the trajectories that represent how humans actually use their phones. Shopping app → messaging app → back to shopping app. That sequential, cross-app understanding is what makes AutoGLM different from a model that can merely describe what’s on a screen.

Why this is a bigger bet than it looks — Zixuan sees phone and hardware integration as the long-term play. Edge models on devices, controlling physical interactions. Today AutoGLM sometimes orders you a tea when you asked for boba. But Zixuan’s prediction: by 2026 or 2027, phone agents will perform tasks better than humans. And once you’ve solved phone interaction, robots and other hardware become tractable extensions of the same architecture.

The speed of the community response — iOS port in 48 hours — validates the open-source strategy. You build for one platform, the ecosystem builds the rest.


4. The Data Economics That Decide What’s Worth Training

The Event — When pressed on why Zhipu doesn’t chase IMO-level math performance, Zixuan gave a framework that’s more honest than what most labs will tell you: it comes down to data annotation economics.

The constraint — IMO produces six problems per year. To train a model that reliably solves IMO-level problems, you might need thousands of similar problems, each with verified correct solutions. That requires scientists-level annotators — people who actually understand competition mathematics at the highest level. DeepMind can hire those people. An open-source lab operating with fewer resources cannot.

Compare that to coding data, role-playing data, or homework-level math. The annotation pool is orders of magnitude larger. More people can verify the quality. The data itself is more abundant online. The cost-per-example drops dramatically.

The framework — For domains with abundant data (role-playing, general coding, everyday tasks), the bottleneck is evaluation — figuring out which of your million data points represents the best response. For domains with scarce data (Olympiad math, frontier science), the bottleneck is the query itself — you can’t even ask the right training questions without domain expertise.

Zhipu allocates resources accordingly. They stop where the data economics stop making sense, and double down where they can generate the most training signal per dollar. This isn’t a capability limitation — it’s a strategic allocation decision. And Zixuan was refreshingly direct: “We cannot achieve AGI from simpler tasks. We have to do this someday.” But today, survival comes first.


5. Coding Is Not Just Coding

The Event — When asked where Zhipu sees the most user hunger, Zixuan kept returning to coding — but with a specific reframe.

The reframe — Coding, in Zhipu’s framing, isn’t a developer-only capability. Web coding means non-programmers building side projects. Tool use is a form of coding. Agent orchestration is a form of coding. The boundary between “coding” and “using tools to solve problems” has collapsed.

More importantly, Zixuan argued that strong coding ability is a prerequisite for strong tool use. A model that understands code structure, function calls, state management, and error handling is fundamentally better equipped to operate tools than one trained purely on natural language task completion. This is why Zhipu prioritizes coding capability even though their user base isn’t exclusively developers.

The competitive read — Zixuan explicitly identified Anthropic as their primary comparison — not OpenAI, not Google. The reasoning: Anthropic leads on coding and agentic tool use, which are the capabilities Zhipu cares about most. Their competitive frame is: can a GLM model combined with Aider, Cline, Kilo Code, or another coding agent match or beat the Claude + Claude Code combination? They’re tracking user reports of exactly these comparisons.


6. The GLM Coding Plan: Inverting the Model-Agent Relationship

The Event — Zhipu launched what they call the GLM Coding Plan, which inverts the traditional relationship between models and coding agents.

The traditional approach — Today, you pick a coding agent (Cursor, Claude Code, Cline) and then choose which model powers it. The agent evolves at its own pace; the model evolves at its own pace. Sometimes those paces diverge, creating compatibility gaps — a new model might not work well with an agent’s existing prompting structure, or an agent update might break what worked with a previous model version.

Zhipu’s inversion — With the GLM Coding Plan, you subscribe to the model first, then choose which coding agent to pair with it. Zhipu optimizes GLM for multiple agent frontends and tracks which combinations work best. They found that while ~7% of their users still default to Claude Code (running GLM through it), a vocal segment reports that GLM + Aider outperforms Claude + Claude Code for certain workflows.

Why this matters strategically — By making the model the anchor and the agent the variable, Zhipu positions itself as the platform layer rather than a commodity input. If your coding workflow is built around GLM’s strengths, switching to a different model means re-learning which agent works best, which prompting patterns are optimal, which failure modes to watch for. That’s real switching cost in a market where most model providers are interchangeable commodities.


7. Partnership as Infrastructure, Not Marketing

The Event — Zixuan explained Zhipu’s partnerships with Vercel, OpenRouter, Cline, and Copilot as infrastructure decisions rather than distribution plays.

The problem they’re solving — When Zhipu launches a model, they sometimes aren’t fully prepared for the inference load. They can’t go to every customer individually and negotiate concurrency limits (you get 5, you get 10, you get 20). Large platforms like Vercel and OpenRouter act as wholesale distribution — they handle the customer-facing service guarantees, rate limiting, and reliability that Zhipu would otherwise need to build in-house.

The deeper play — These partnerships also function as a real-world evaluation pipeline. By integrating into Manus, coding agents, and other production applications, Zhipu gets feedback on whether their synthetic training trajectories actually map to real user needs. They can observe: does GLM perform the task better than Claude in this specific integration? Where does it fall short? That signal is more valuable than any benchmark.

The resource constraint logic — Zixuan was explicit about ROI assessment on research partnerships too. They absorb knowledge from papers constantly, but co-creating research requires resource allocation they can’t always justify. The calculus: is it more efficient to learn from others’ published work, or to co-create? For a resource-constrained lab, learning from others and focusing engineering effort on integration often wins.


8. The Transformer Ceiling and What Comes After

The Event — In a candid moment, Zixuan acknowledged that even with the right data and the right training, there’s a ceiling above them — and it might be architectural.

What he said — “We have to understand the limitation of transformers, the limitation of the architecture. Even when we want to do something, even when we have the data, sometimes there’s a ceiling above us.” The path forward requires figuring out whether the ceiling is broken by higher quality data, new architectures, or something else entirely.

The experimentation approach — Zhipu doesn’t just experiment with their own models. They test hypotheses against Gemini, DeepSeek, and other frontier models to understand whether observed limitations are specific to GLM or fundamental to the transformer paradigm. This cross-model experimentation is how they decide where to invest in architecture research versus where to optimize within existing constraints.

The reinforcement learning tension — Zixuan sees RL as useful for pushing from 95% to 100% on specific capabilities, but not for bridging the gap from 30% to 100%. DeepSeek is doing this well, in his view. But for Zhipu, the priority remains: data for fine-tuning is still the most important lever. RL is a finishing tool, not a foundation.


9. “What I Said May Not Be True Next Month”

The Closer — Zixuan ended with something that should be taken seriously: the pace of change means any strategy articulated today might be obsolete in weeks. Zhipu might not be a language model company next year — they could be a robot company, or something else entirely. The constant is the mission: chasing AGI and building intelligence that serves human needs.

The honest version of Zhipu’s position: they are a resource-constrained lab making calculated bets in an environment where the frontier moves weekly. They survive by being ruthlessly practical about what their users need, transparent about what they can’t do, and fast enough to adapt when the landscape shifts.

The competitive moat isn’t the model. It’s the feedback loop between user behavior, training data, and deployment infrastructure that lets them ship a “good enough” model faster and cheaper than anyone else.


Subscribe to support AI Made Simple and help us deliver more quality information to you-

Share

Flexible pricing available—pay what matches your budget here.

Thank you for being here, and I hope you have a wonderful day.

Dev <3

If you liked this article and wish to share it, please refer to the following guidelines.

Share

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. The best way to share testimonials is to share articles and tag me in your post so I can see/share it.

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

My (imaginary) sister’s favorite MLOps Podcast-

Check out my other articles on Medium. :

https://machine-learning-made-simple.medium.com/

My YouTube: https://www.youtube.com/@ChocolateMilkCultLeader/

Reach out to me on LinkedIn. Let’s connect: https://www.linkedin.com/in/devansh-devansh-516004168/

My Instagram: https://www.instagram.com/iseethings404/

My Twitter: https://twitter.com/Machine01776819

Discussion about this video

User's avatar

Ready for more?