Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Most Important AI Updates of the week. Feb 16th 2026-Feb 22 2026 [Livestreams]

why AI benchmarks are failing and more

Devansh

Feb 23, 2026

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction. We run on a “pay what you can” model—so if you believe in the mission, there’s likely a plan that fits (over here).

Every subscription helps me stay independent, avoid clickbait, and focus on depth over noise, and I deeply appreciate everyone who chooses to support our cult.

Help me buy chocolate milk

PS – Supporting this work doesn’t have to come out of your pocket. If you read this as part of your professional development, you can use this email template to request reimbursement for your subscription.

Every month, the Chocolate Milk Cult reaches over a million Builders, Investors, Policy Makers, Leaders, and more. If you’d like to meet other members of our community, please fill out this contact form here (I will never sell your data nor will I make intros w/o your explicit permission)- https://forms.gle/Pi1pGLuS1FmzXoLr6

Thanks to everyone for showing up the live-stream. Mark your calendars for 8 PM EST, Sundays, to make sure you can come in live and ask questions.

Bring your moms and grandmoms into the Chocolate Milk Cult.

As usual, we have a new foster cat that’s ready to be adopted. I call him Chipku (Hindi for clingy; his government name is Jancy), and as you might guess, he’s very affectionate. I’ve trained him to be better around animals and strangers, and he’s perfect for families that already have some experience with cats. We sleep together every day, and waking up to him is one of the nicest feelings. If you’re around New York City, adopt him here (or share this listing with someone who might be interested).

Community Spotlight: RLMs

Recursive Language Models (RLMs) are a task-agnostic inference paradigm for language models (LMs) to handle near-infinite length contexts by enabling the LM to programmatically examine, decompose, and recursively call itself over its input. RLMs replace the canonical llm.completion(prompt, model) call with a rlm.completion(prompt, model) call. RLMs offload the context as a variable in a REPL environment that the LM can interact with and launch sub-LM calls inside of. It’s a super interesting idea and I’d suggest playign with the concept by yourself. You can find the Github here.

If you’re doing interesting work and would like to be featured in the spotlight section, just drop your introduction in the comments/by reaching out to me. There are no rules- you could talk about a paper you’ve written, an interesting project you’ve worked on, some personal challenge you’re working on, ask me to promote your company/product, or anything else you consider important. The goal is to get to know you better, and possibly connect you with interesting people in our chocolate milk cult. No costs/obligations are attached.

Additional Recommendations (not in Livestream)

“the watchers: how openai, the US government, and persona built an identity surveillance machine that files reports on you to the feds”: Researchers discovered publicly accessible source maps tied to identity verification infrastructure from Persona, a vendor used by OpenAI. Those source maps referenced internal modules related to watchlist screening, politically exposed person checks, risk scoring, and reporting pathways connected to agencies like FinCEN and FINTRAC. This confirms is that modern identity verification stacks often include full compliance tooling—not just photo matching. I didn’t cover this on the live because we don’t have confirmation yet on how bad this is, but for now it’s worth noting how deep these systems are getting. Will keep you updated as the story develops.
“The Man Too Pathetic to Punish (Ft. Value Select!)”: Raynald de Chatillon was a broke French knight who built a pirate fleet in the middle of a desert and may have tried to dig up the Prophet Muhammad’s bones as a tourist attraction. And everyone let him go because his cringe was too powerful.
Rubric-Based Rewards for RL: Cameron R. Wolfe, Ph.D. publishes, I share his posts. I’m still not done with the article but lots of great takeaways.
“The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning”: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
Whiite-Collar Apocalypse Isn’t Around the Corner—But AI Has Already Fundamentally Changed the Economy: Incredible analysis of the impacts of AI by one of the worlds smartest AI commentators. We’re all very lucky that James Wang shares his insights free of cost, enjoy it while it lasts folks.
Part I: The Gala, the Suburbs, and the “Months Behind” Myth in LLM Labs. Great analysis of the Chinese Ecosystem by Grace Shao . My question from it— We often see Chinese labs introduce interesting architectural innovations — for example, Kimi’s muon clip optimizer or DeepSeek’s hyperconnection approach. In these cases, the improvements don’t primarily come from pre-training scale or post-training techniques. They’re more algorithmic or architectural breakthroughs. How do you think these kinds of innovations factor into the overall trajectory? Are they central drivers of progress, or more like exceptions to the broader trend?

Companion Guide to the Livestream

This guide expands the core ideas from the stream and structures them for deeper reflection. Most of you are reading this rather than watching — so the goal here is to make sure you get everything. Watch the full stream for tone, tangents, and the cat cameo at the end.

1. Anthropic Stopped Trying to Win the Intelligence Race — And That Might Be the Smartest Move in the Market

What happened

Sonnet 4.6 dropped this week. Both Sonnet 4.6 and Opus 4.6 now carry a 1 million token context window in beta — and most coverage treated that as the headline. It isn’t. The headline is what that context window is actually being built for, and what it tells us about where Anthropic is placing its bet going into the rest of 2026.

Why this matters

Cast your mind back 18 months. If you were building a serious production AI system, you wouldn’t touch Anthropic for agentic work. Claude was arguably the most intelligent model in the market — the reasoning was sharp, the outputs were nuanced — but the moment you gave it a system prompt with 50 rules and a stack of function calls, it fell apart. Instruction following was unreliable. Multi-step tool orchestration was inconsistent. It wasn’t built for that kind of work. OpenAI’s GPT-4.0 and 4.1, despite being arguably less capable at pure reasoning, dominated agentic workflows. Hand them a large ruleset and say “call these functions in this order, follow these constraints,” and they would do it. Reliably. Repeatedly. That made them the default backbone for anyone building serious AI systems, even among teams that would have preferred Anthropic’s reasoning quality. What happened around December changed the picture. Anthropic released an Opus update that had clearly been trained on synthetic agent chains — long, complex sequences of tool calls with conditional branching. The model learned to call a tool, evaluate what came back, and dynamically revise its next move based on the output. Claude Code became the “oh shit” moment for the developer community not because it was suddenly smarter, but because it could operate. It could navigate real environments, handle unexpected outputs, and maintain coherent intent across dozens of sequential tool calls without losing the thread.

The architecture Anthropic is likely building toward

The 1 million token context window in both Sonnet and Opus is a direct extension of that same bet, and it’s worth being precise about what it’s actually for. The assumption most people make is that large context windows are for humans who want to dump codebases into a chat and ask questions. That’s not the primary use case Anthropic is building for. They’re building for agents that need to read full debug logs, trace complete agent trajectories, and hold the state of an entire repository in memory across a long task. The context isn’t there for retrieval — it’s there for operational awareness. Watch the divergence between Sonnet and Opus carefully. The architectural split is older than this release — Sonnet actually had the 1M context window before Opus did in earlier generations, which is itself a signal worth sitting with. Post-3.5 is where this divergence really started to show up, and both models carry it now in 4.6. The hypothesis here — speculative, but directionally supported by everything Anthropic has shipped — is that Sonnet is being optimised as an orchestrator and Opus as an executor. An orchestrator needs to hold a meta-view of an entire system: what has been done, what needs to happen next, where the problems are. That’s why you need the large context window on the orchestrator. Opus then comes in with high-capability execution on specific subtasks. You don’t need a million-token context window to fix a specific bug once someone tells you exactly where it is. This maps to a broader shift in how AI systems are being architected. The value is no longer in having one genius model that does everything. It’s in models that know their role, stay in their lane, and communicate well with each other. The vocabulary of “orchestrators” and “executors” is going to become standard design language over the next 12 months.

The implication for builders

Anthropic is positioning to be the dominant choice for orchestration layers in serious agentic systems. Not because Claude is the most intelligent model — GPT-5.2 still holds that crown for pure reasoning tasks — but because Anthropic has done the specific work to make a model that understands how to operate within a system. That turns out to be the more valuable capability when you’re building at scale. Sonnet 4.6’s early evals back this up: developers preferred it over Sonnet 4.5 70% of the time in Claude Code testing, and even preferred it over Opus 4.5 in 59% of comparisons. The intelligence gap is narrowing. The operational gap has already closed.

Read about Weka, the startup that’s betting on the growth of context and KV caches from this agentic explosion here—

Read about the trends we can infer from the demand for AI models here—

2. What Is Actually Wrong with Gemini 3.1 (And Why the Problems Are Structural, Not Fixable with a Model Update)

What happened

Gemini 3.1 Pro dropped on February 19th. On paper the numbers look compelling — 77.1% on ARC-AGI-2, which is more than double Gemini 3 Pro’s score, and a #1 ranking on Artificial Analysis’s intelligence index at time of publication. Google is presenting this as a major step forward in reasoning. The benchmark story is genuinely impressive. The problem is that benchmarks are exactly the thing you should be sceptical of here, and there are structural reasons — in how the model was trained and in how DeepMind runs its research culture — that make this more than a model quality debate.

The training pipeline problem

To understand what went wrong, you need to understand how modern large language models are trained. Pre-training is where a model builds its fundamental world model — it ingests vast quantities of data, learns associations, and develops its base understanding of how information relates to information. This is the phase that sets the raw intelligence ceiling. Then comes the training phase, and finally post-training, where you fine-tune behaviour, align with human preferences, and shape how the model responds in practice. Historically, model intelligence came primarily from pre-training. Benchmarks tracked this reasonably well because they were designed to test generalised capability. The smarter your base model, the better your benchmark performance. That relationship has started to break down. Pre-training data has largely converged across labs — everyone is drawing from similar sources, similar volumes, similar techniques. The differentiation at pre-training has narrowed significantly. So labs started treating post-training as the primary lever for benchmark performance. Here’s where DeepMind made a critical error. They invested heavily in post-training aimed specifically at benchmark scores. The mechanism is straightforward: build synthetic datasets that mirror the benchmark distribution as closely as possible, and train the model aggressively against those distributions. The model gets very good at those benchmarks. The press release looks great. But this technique has a well-documented failure mode. When you train a model to optimise against a specific distribution — and especially when you use reinforcement learning to do it — the model learns to game that distribution rather than develop the underlying capability the benchmark was designed to measure. RL is particularly aggressive at this. Its entire training objective is to maximise a reward signal, and the most efficient path to doing that is almost always to find a shortcut rather than to actually solve the problem. You end up with a model that has learned the texture of benchmark questions rather than the underlying reasoning. The moment real tasks diverge from that texture — agentic workflows, complex instruction chains, genuinely novel problems — the shortcuts stop working and the model behaves erratically. This is the core prediction: the benchmark numbers won’t translate to a Gemini CLI comeback, and Gemini 3.1 is not going to be the developer adoption story Google is hoping for. The model was never unintelligent to begin with. Gemini’s base capability has always been strong. What happened is that a fundamentally capable model got trained into a very narrow region of competence. You can’t undo that by scaling — you have to retrain with a different objective.

Why this keeps happening at large labs

The research culture problem is worth understanding clearly because it isn’t unique to DeepMind. It’s a structural feature of how large AI labs manage research investment under uncertainty. Research on fundamental improvements — novel architectures, new training paradigms, mathematical innovations — has a specific and uncomfortable property: you cannot predict in advance whether it will work. If you’re a researcher at a major lab and you want to pursue something genuinely novel, you have to go to leadership and say “give me resources and time, and I genuinely don’t know if this will produce anything.” That is a very difficult sell when the people making budget decisions are generalists under pressure to show consistent progress. Post-training benchmark optimisation, by contrast, offers highly predictable returns. “Give me X compute and X synthetic data and I’ll get you Y improvement on this evaluation.” Scaling laws became beloved in research organisations precisely because they turned AI research into a capital allocation problem — more money, predictable performance curve. You could walk into a budget meeting with a graph and a number. Fundamental research can’t give you that. The result is a systematic selection effect: the research that gets funded is the research that can demonstrate near-term measurable returns. The researchers who stay and get promoted are the ones who are good at navigating that system. The people with the highest appetite for genuine paradigm-shifting work tend to hit walls, get frustrated, and leave to start their own things or join smaller shops. At a company like Google this effect is compounded by organisational scale. Research direction decisions are being made several levels removed from the actual work. The sunk cost bias toward already-funded projects over scrappy new ones with no track record is well established. You end up pouring resources into benchmark optimisation not because it’s the best strategy, but because it’s the safest career move for everyone in the decision chain. What you’re left with is a very expensive model that performs well in a very specific setting and struggles everywhere else.

The falsifiable call

There will be noise in the coming weeks about Gemini 3.1 driving a CLI comeback and a resurgence in developer adoption. It won’t materialise — not the way Claude Code happened in December, not the way Codex is starting to get traction. The Gemini Flash and Flash-Lite models are genuinely good, interestingly for the inverse reason: when you over-optimise the main model and then run aggressive regularisation to distill a smaller version, you sometimes get something that works well in its narrower operating range. But the flagship intelligence at the agentic and complex instruction-following level is not where it needs to be. You can verify this yourself. Take it off the benchmark distribution and see what happens.

Why RL is overrated—

3. GPT-5.3 Codex and Codex-Spark: Reading What OpenAI Is Actually Telling You

What happened

OpenAI dropped two things this week. GPT-5.3-Codex launched on February 12th — a full model update, their most capable agentic coding model to date, trained with and served on NVIDIA GB200 NVL72 systems. Then came Codex-Spark: a smaller, faster version running not on NVIDIA hardware but on Cerebras’s Wafer-Scale Engine 3. Codex-Spark launched inside the Codex app — which at release was Mac-only, meaning a lot of people (Windows users included) didn’t get access until much later. It’s delivering over 1,000 tokens per second, roughly 15x faster than the standard Codex model.

What Codex-Spark is actually telling you

The model quality question for GPT-5.3 Codex is genuinely difficult to answer right now. It doesn’t obviously outperform GPT-5.2 on every task, and separating model improvements from infrastructure improvements — they also ran GPT-5.3 Codex 25% faster through inference stack upgrades — makes direct comparison hard. That’s fine. The more interesting signal is the infrastructure story. OpenAI partnered with Cerebras in January in a multi-year deal worth over $10 billion. Codex-Spark is the first concrete output of that partnership and the first OpenAI model not running on NVIDIA hardware. Sam Altman publicly called NVIDIA “the best chip makers in the world” and described the relationship as foundational — that’s the public line, and of course it’s the public line. Jensen’s not happy about this. The strategic intent is clear enough: diversifying inference infrastructure away from near-total GPU dependency reduces supply chain risk and creates negotiating leverage. Cerebras’s Wafer-Scale Engine is built for inference in a way that GPU clusters aren’t — the entire model lives on a single wafer of silicon, eliminating the inter-chip communication latency that slows GPU clusters down. That architecture is specifically suited to the low-latency, high-throughput demands of agentic workflows.

The bigger strategic read

The Cerebras move, combined with the strong inference speed focus, maps to a specific strategic trajectory for OpenAI. They appear to be hitting a ceiling on raw intelligence gains and shifting significant resources toward deployment economics — faster inference, lower cost per token, eventually the ad-supported tier for free users. That’s the rational move for a company on an IPO trajectory that needs to demonstrate a credible path to profitability. Intelligence is a research problem. Margins are a business problem. OpenAI is increasingly focused on the business problem. There’s also a competitive rebalancing worth acknowledging. Eighteen months ago, OpenAI had the best models for agentic work. Anthropic has taken that ground. OpenAI still holds the lead for pure intelligence tasks — if you want the most accurate analysis of a genuinely ambiguous problem, GPT-5.2 is still the benchmark. But agentic workflows, complex tool chains, software development in real environments — that’s Anthropic’s territory now. The two labs have effectively traded positions, and both are consolidating into their new ground.

We broke down the trends in the hardware market first in our deep-dive into why GPUs were no longer the one size fits all answer, and why inference was opening a new market here (which is what we see with Cerebras and OpenAI vs Nvidia)

4. NVIDIA Blackwell and the Meta Deal: The Hardware Layer Catches Up

What happened

Two hardware stories worth tracking together. Semi-Analysis published throughput numbers for NVIDIA’s Blackwell Ultra: approximately 50x better throughput and 35x lower cost per token versus the Hopper generation. Separately, Meta signed a multi-year agreement with NVIDIA covering both GPUs and CPUs.

On the Blackwell numbers

Semi-Analysis does genuinely high quality technical work. They’re also no longer fully independent — institutional relationships and access considerations create soft incentives to present company narratives charitably. Treat the 50x and 35x figures as directional upper bounds, not engineering specs. Even a fraction of those improvements would still represent a major generational step in inference economics. The more interesting detail is what Semi-Analysis flagged as the primary design targets: large context windows and agentic workflows. Blackwell’s memory architecture and bandwidth improvements are specifically suited to the workloads that both Anthropic and OpenAI are building toward — long context chains, multi-agent communication, iterative tool calls. This is not a coincidence. The hardware layer is aligning behind the same architectural thesis as the model layer.

On the Meta-NVIDIA deal

The CPU inclusion is the interesting part. NVIDIA is using CPUs as an ecosystem lock-in mechanism — take the GPU deal, take the CPU package, and now your entire stack is NVIDIA-native. Your kernels are optimised for their memory hierarchy, your networking is InfiniBand, your switching costs become substantial. AMD and Intel have been trying to build an open-standards alternative to CUDA for years, partly through CPU-GPU integration plays. Getting Meta into the full NVIDIA ecosystem is a direct counter-move to that effort. Google was reportedly in conversations with Meta about TPUs and didn’t close — very Google. The TPU ecosystem remains capable, particularly for inference, but the kernel support and integration burden is real, and NVIDIA’s full-stack offer won out. Meta has the money and the strategic rationale to be investing heavily in custom silicon for their longer-term VR and consumer hardware ambitions, but the people making these decisions at the top of large organisations tend not to be the most risk-tolerant. The NVIDIA deal is the safer play. Whether it’s the better one is a different question.

5. The Research Horizon: Predicting Model Properties Before You Train

What’s being worked on

Can you mathematically derive the properties of a trained model from the geometric structure of the model before training it at all? Not just “more compute, better performance” — but genuinely predicting which capabilities will emerge, how the model will behave on classification tasks, what its ceiling looks like, from the architecture itself.

Why this is a hard and worthwhile problem

The current paradigm for understanding AI capability is empirical. You train, you evaluate, you observe. Scaling laws gave us predictability in a narrow sense — more compute predicts better benchmark performance — but they don’t tell you anything about which capabilities emerge, when they emerge, or what the model will fail at. The training run is the experiment, and the experiment is expensive. What this research is probing is whether the geometry of a model’s weight space — the mathematical structure of how information is encoded, how different regions of the latent space relate to each other, how basins of attraction are shaped — contains information about capability before training. If it does, you can do pre-training analysis. You can fail faster, iterate on design rather than on training runs, and potentially understand why models with certain structural properties are better or worse at certain classes of tasks. This is high-variance research. It might not work. The prize if it does is significant, and it’s exactly the kind of work that large labs are structurally least able to fund for the reasons outlined in the earlier section.

This is an extension of our earlier work on Fractal Embeddings, shared here—

Subscribe to support AI Made Simple and help us deliver more quality information to you-

Flexible pricing available—pay what matches your budget here.

Thank you for being here, and I hope you have a wonderful day.

Dev <3

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. The best way to share testimonials is to share articles and tag me in your post so I can see/share it.