So the only content online I’ve ever actually paid for knowingly is the Wall Street Journal - got my digital subscription in 1999 - and your newsletter. I love how you deep dive without the hype but with more than a little snark. Ha ha!
Thank you for writing this. This is going to take me a few days to digest and fully understand. But I haven’t come across a more accessible breakdown of costs. I’m bookmarking this.
As someone who can barely keep up with the terminology I think this has been the most interesting read so far for me on Substack and deserving of my first comment.
Keep up the good work and if it brings a smile to your face Claude got a lot of "what does x mean" prompts thanks to you.
Fantastic article, Devansh. I'm going to have to reread sections of it multiple times to help details sink in. But two things stood out to me as questionable.
"The key concept to understand here is arithmetic intensity — the ratio of compute operations to bytes of memory moved. It tells you whether your hardware is spending its time doing useful math or sitting idle waiting for data."
The assumption that busy GPUs are doing "useful math" is completely unfounded. That might be true during a session, but the prefill at the start of a new sessions is definitely wasting wall clock time and electricity. Many later tokens likely are, too. At least some should be coming from a shared cache rather than being recomputed all the time.
It might the equivalent of paying workers or instructing prisoners to move a pile of rocks from point A to point B and then having them move everything back to point A afterward. In a scientific sense, work is being done because matter is moving. But it isn't useful work.
"Each concurrent user needs their own KV cache. You can’t share contexts between people having different conversations (although that would be pretty fun to see)."
This is also an assumption that might well be false. We've seen recently that restating a prompt improves the accuracy of non-reasoning models. Applications where multiple users or multiple agents are looking for the same content would definitely benefit from shared, cached context. For example, a whole lot of people have done web searches about Bad Bunny recently. Or the Olympics. Or "best sushi near me." If operators can steer sessions toward GPUs with pre-warmed context that is likely to be relevant, they can deliver answers and value faster. They have no incentive to do so as long as they're billing based on token counts, but the SaaSpocalypse has shown us the need to charge for value delivered rather than tokens consumed.
Also re incentives, I'd pushback there. Compute is the bottleneck right now. By caching and sharing contexts, API providers skip the compute-heavy prefill phase entirely, serving more users on the same hardware, thus maximizing margin. This is why Anthropic and OpenAI explicitly offer and discount Prompt Caching.
Great point. Neoclouds are routinely winning or losing deals because of access to hardware that is ready to sell. Anthropic and OpenAI are facing the same issue. They have a strong incentive that I wasn't considering to make less compute-intensive offerings available so that they can grow and retain market share.
1. We're saying useful because it's work that has to be done to fulfill a task. The article is a mechanics model of transformer inference: what the hardware is doing and what it costs. Whether any given computation is economically useful is a separate question that sits one layer up. That said, the assumption isn't unreasonable: if the model has been asked to do a task, the prefill serving that task is at minimum instrumentally useful.
2. Cache sharing is very real and we even use it at Iqidis — system prompts get cached, and within a firm you can cache shared matter/process instructions and context at different abstraction layers, which meaningfully cuts redundant compute. Works well when the prefix is controlled. At OpenAI's scale though, personal memory, session history, language, application context — all of that makes finding a shared prefix much harder. Not impossible, but the routing to something semantically close enough becomes its own problem.
Final thing-- restating prompts improves accuracy for logit guiding reasonns, not KV cache reasons. The mechanics of that specific thing (how it shapes logit behavior was shared here)
I definitely think routing and load balancing are areas where providers and enterprises have opportunities to drive major efficiency improvements in their AI stacks. At the very least, everyone is looking for ways to ratchet up GPU over-subscription. Making better choices about how to group or distribute sessions is a hot topic.
Solid breakdown of the transformer cost formula. The 24nd² + 4n²d per layer makes it clear why long-context inference burns so much compute.
I've been looking at this from a different angle—not just the raw costs, but how providers actually price their services. The gap between what inference truly costs and what some platforms charge is pretty striking.
So the only content online I’ve ever actually paid for knowingly is the Wall Street Journal - got my digital subscription in 1999 - and your newsletter. I love how you deep dive without the hype but with more than a little snark. Ha ha!
What a nice compliment. Thank you so much. Will keep deliverign on this to give you the best results
Thank you for writing this. This is going to take me a few days to digest and fully understand. But I haven’t come across a more accessible breakdown of costs. I’m bookmarking this.
I’m glad you liked it.
Share it with some people if you think they might appreciate it!
As someone who can barely keep up with the terminology I think this has been the most interesting read so far for me on Substack and deserving of my first comment.
Keep up the good work and if it brings a smile to your face Claude got a lot of "what does x mean" prompts thanks to you.
I’m glad this was interesting. Many more to come
Fantastic article, Devansh. I'm going to have to reread sections of it multiple times to help details sink in. But two things stood out to me as questionable.
"The key concept to understand here is arithmetic intensity — the ratio of compute operations to bytes of memory moved. It tells you whether your hardware is spending its time doing useful math or sitting idle waiting for data."
The assumption that busy GPUs are doing "useful math" is completely unfounded. That might be true during a session, but the prefill at the start of a new sessions is definitely wasting wall clock time and electricity. Many later tokens likely are, too. At least some should be coming from a shared cache rather than being recomputed all the time.
It might the equivalent of paying workers or instructing prisoners to move a pile of rocks from point A to point B and then having them move everything back to point A afterward. In a scientific sense, work is being done because matter is moving. But it isn't useful work.
"Each concurrent user needs their own KV cache. You can’t share contexts between people having different conversations (although that would be pretty fun to see)."
This is also an assumption that might well be false. We've seen recently that restating a prompt improves the accuracy of non-reasoning models. Applications where multiple users or multiple agents are looking for the same content would definitely benefit from shared, cached context. For example, a whole lot of people have done web searches about Bad Bunny recently. Or the Olympics. Or "best sushi near me." If operators can steer sessions toward GPUs with pre-warmed context that is likely to be relevant, they can deliver answers and value faster. They have no incentive to do so as long as they're billing based on token counts, but the SaaSpocalypse has shown us the need to charge for value delivered rather than tokens consumed.
Also re incentives, I'd pushback there. Compute is the bottleneck right now. By caching and sharing contexts, API providers skip the compute-heavy prefill phase entirely, serving more users on the same hardware, thus maximizing margin. This is why Anthropic and OpenAI explicitly offer and discount Prompt Caching.
Great point. Neoclouds are routinely winning or losing deals because of access to hardware that is ready to sell. Anthropic and OpenAI are facing the same issue. They have a strong incentive that I wasn't considering to make less compute-intensive offerings available so that they can grow and retain market share.
These are great points.
1. We're saying useful because it's work that has to be done to fulfill a task. The article is a mechanics model of transformer inference: what the hardware is doing and what it costs. Whether any given computation is economically useful is a separate question that sits one layer up. That said, the assumption isn't unreasonable: if the model has been asked to do a task, the prefill serving that task is at minimum instrumentally useful.
2. Cache sharing is very real and we even use it at Iqidis — system prompts get cached, and within a firm you can cache shared matter/process instructions and context at different abstraction layers, which meaningfully cuts redundant compute. Works well when the prefix is controlled. At OpenAI's scale though, personal memory, session history, language, application context — all of that makes finding a shared prefix much harder. Not impossible, but the routing to something semantically close enough becomes its own problem.
Final thing-- restating prompts improves accuracy for logit guiding reasonns, not KV cache reasons. The mechanics of that specific thing (how it shapes logit behavior was shared here)
https://www.artificialintelligencemadesimple.com/p/prompting-how-to-become-more-effective?utm_source=publication-search
I definitely think routing and load balancing are areas where providers and enterprises have opportunities to drive major efficiency improvements in their AI stacks. At the very least, everyone is looking for ways to ratchet up GPU over-subscription. Making better choices about how to group or distribute sessions is a hot topic.
Such a wonderful piece! Thanks for sharing your thoughts on the economics of running AI 🫡
Thank you mate
Totally wonderful. Thank you 🙏
Very happy you enjoyed this
Thank you so much Devansh! I had been looking for or trying to figure this out for myself. Great contribution!
I'm glad you liked it
Solid breakdown of the transformer cost formula. The 24nd² + 4n²d per layer makes it clear why long-context inference burns so much compute.
I've been looking at this from a different angle—not just the raw costs, but how providers actually price their services. The gap between what inference truly costs and what some platforms charge is pretty striking.
I ended up writing about the recent price hikes from Chutes, Z.ai, and Synthetic—why the providers raising prices might actually be the honest ones: https://sulat.com/p/the-real-cost-of-cheap-ai-inference
Bottom line: when you see $3/month for 'unlimited' frontier model access, someone's eating the difference.