IBM Granite and The Small Model Thesis[Livestream]

IBM's Kate Soule on why starting with frontier models locks you into losing architectures

Dec 05, 2025

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction. We run on a “pay what you can” model—so if you believe in the mission, there’s likely a plan that fits (over here).

Every subscription helps me stay independent, avoid clickbait, and focus on depth over noise, and I deeply appreciate everyone who chooses to support our cult.

Help me buy chocolate milk

PS – Supporting this work doesn’t have to come out of your pocket. If you read this as part of your professional development, you can use this email template to request reimbursement for your subscription.

Every month, the Chocolate Milk Cult reaches over a million Builders, Investors, Policy Makers, Leaders, and more. If you’d like to meet other members of our community, please fill out this contact form here (I will never sell your data nor will I make intros w/o your explicit permission)- https://forms.gle/Pi1pGLuS1FmzXoLr6

Community Spotlight:

If you’re doing interesting work and would like to be featured in the spotlight section, just drop your introduction in the comments/by reaching out to me. There are no rules- you could talk about a paper you’ve written, an interesting project you’ve worked on, some personal challenge you’re working on, ask me to promote your company/product, or anything else you consider important. The goal is to get to know you better, and possibly connect you with interesting people in our chocolate milk cult. No costs/obligations are attached.

Companion Guide to the Livestream: IBM’s Kate Soule on Small Model Strategy

This guide expands the core ideas and structures them for deeper reflection. Watch the full stream for tone, nuance, and side-commentary.

1. The Small Model First Paradigm

The Event — Kate Soule, Product Director at IBM Research leading the Granite program, laid out IBM’s thesis: the mental model distinguishing “open weight vs closed” is less useful than asking “edge vs non-edge.” The real question isn’t who controls the weights—it’s whether you can bring the model to where the data lives.

Why this reframes everything — Most AI architecture decisions get locked in during the POC phase. You swipe a credit card, hit a frontier endpoint, build something that works. Then production time comes and 80% of projects die on cost, 20% on licensing/security. The issue isn’t that the POC failed—it’s that you architected around assumptions that don’t survive contact with scale.

Kate’s argument: if you start with frontier models, you inherit their constraints. One expensive inference request, cram everything into it, pray for a good output. But if you start with small models, entirely different design patterns emerge. Run 100 inferences at a fraction of the cost, pick the best one. That’s not an optimization of the frontier approach—it’s a different paradigm you can’t retrofit.

Insight — The question isn’t “can small models match frontier performance?” It’s “what systems would we design if we assumed abundant cheap inference instead of scarce expensive inference?” IBM is betting those systems look fundamentally different.

2. Granite 4’s Hybrid Architecture

The Event — IBM released Granite 4 with models ranging from 350M to 32B parameters. The interesting technical decision: hybrid architectures combining Mamba 2 layers with traditional full attention layers.

The engineering logic — This isn’t about accuracy. Mamba layers reduce memory consumption on long-context tasks by escaping quadratic attention bottlenecks. As context windows expand for agentic workloads, memory becomes the binding constraint before compute does. The hybrid approach lets you run production workloads on cheaper hardware at longer contexts.

Kate’s number that matters: their 350M parameter model today matches 30B parameter performance from three years ago. The drivers aren’t architectural magic—it’s post-training refinement, RL techniques, and training on 20T+ tokens instead of 1T. The scaling laws shifted from “bigger models” to “more data, better post-training, same size.”

Insight — The small model gains aren’t asymptoting. IBM is betting efficiency improvements continue, which means the cost curve keeps bending. If you’re architecting systems assuming current small model capabilities, you’re already behind.

3. The Inference Scaling Bet

The Event — Kate described IBM Research’s focus on what they call “inference scaling”—the idea that you can trade test-time compute for training-time compute. Instead of investing everything upfront in a massive model, run adaptive inference loops on smaller models.

What this looks like in practice — Melia (M-E-L-L-E-A), IBM’s open-source framework, embodies this philosophy. The core principle: don’t use LLMs when you don’t need them. Alphabetizing a list? That’s been a solved problem since before neural networks. The framework helps you build constraint-checking loops, run multiple inferences until thresholds are met, and route only the genuinely language-dependent tasks to the model.

They’re also working on LoRA adapters they call “intrinsics”—small adapters that run constraint checks by reusing the KV cache from prior generation. You generate a response, then verify it against programmatic requirements without a full new inference pass. Uncertainty quantification layered on top lets you set confidence thresholds for iterative refinement.

Insight — The industry overcorrected. We went from explicit software engineering to “word vomit at an LLM and hope it works.” Kate’s argument: we need to bring back rigor about when language models are actually necessary. The systems that win will be hybrids—explicit code handling deterministic logic, models handling genuine language understanding, with intelligent routing between them.

4. Safety Through Process, Not Promises

The Event — IBM and Anthropic are the only two model developers with ISO 42001 certification. Kate walked through IBM’s approach: rigorous data provenance through IBM Legal, external red teaming, white hat hacker programs with bug bounties feeding directly back to training teams.

Why small models change the safety calculus — The edge deployment model sidesteps entire categories of risk. Data never leaves the system it operates in. You can bring the model to regulated data instead of sending regulated data to a model. For financial institutions (IBM’s core customer base), this isn’t a nice-to-have—it’s the difference between “legally deployable” and “interesting POC.”

Kate’s framing on governance: when you hit a hosted endpoint, you’re at the mercy of terms and conditions that can change. When you own the weights and run locally, you control the entire stack. The open/closed distinction matters less than operational control.

Insight — “No one ever got fired for buying IBM” isn’t just brand legacy. For regulated industries, the compliance story of small local models may matter more than capability deltas with frontier systems.

5. The Multimodal Strategy: Modular, Not Monolithic

The Event — IBM’s multimodal approach differs from the industry standard. Instead of training unified vision-language models, they’re building modular architectures: frozen LLM base with swappable adapters for different modalities. Speech adapter, vision adapter, document understanding adapter—all composable on a single model.

Why this matters for efficiency — You host one model. You assemble the right adapters at inference time based on the modality you need. No separate speech model, vision model, text model consuming separate resources. The granite-docling model (sub-300M parameters) converts PDFs to structured DocTag format and runs on CPUs for high-throughput document processing.

What they explicitly don’t do: image generation, video generation, realistic image understanding. IBM’s position is that content generation carries risk their conservative enterprise customers don’t need, and they’d rather partner than build.

Insight — The industry assumption is that multimodal capability requires multimodal training. IBM is betting that modular composition on frozen LLMs gets you most of the value at a fraction of the compute and risk.

Links & Resources

Mellea (Generative Computing Framework)

GitHub: https://github.com/generative-computing/mellea
Documentation: https://docs.mellea.ai/overview/mellea-welcome
IBM Research Overview: https://research.ibm.com/blog/generative-computing-mellea

Docling (Document Parsing & Conversion Toolkit)

GitHub: https://github.com/docling-project/docling
Documentation: https://docling-project.github.io/docling/
Granite-Docling 258M Model Card: https://huggingface.co/ibm-granite/granite-docling-258M

Granite Models (IBM’s Small-Model Suite)

IBM Granite on Hugging Face: https://huggingface.co/ibm-granite
Granite 4.0 Model Collection: https://huggingface.co/collections/ibm-granite/granite-40-language-models
Ollama Library: https://ollama.com/library/granite4
IBM Guide for Running Granite with Ollama: https://www.ibm.com/granite/docs/run/granite-with-ollama-mac

Hazy Research (Stanford — Intelligence-Per-Watt Work)

Lab Homepage:

https://hazyresearch.stanford.edu/

IPW Blog Post: https://hazyresearch.stanford.edu/blog/2025-11-11-ipw
IPW Paper Summary: https://scalingintelligence.stanford.edu/pubs/ipw/

The stream touched on federated learning opportunities, post-training vs pre-training knowledge injection, and latent space exploration. Those threads deserve their own deep dive—but the core thesis is clear: IBM is betting that the efficiency frontier keeps moving, and the winners will be those who architected for cheap abundant inference from the start.