You see the problem clearly. Benchmarks are gamed. Tokenmaxxing is a lie. Popularity metrics are manufactured. Deployment partnerships are expensive and not scalable. You describe every symptom with precision. But you cannot see the solution. You are doing triple somersaults to measure input — tokens, benchmarks, virality, FDEs — when the answer is simple. Measure output. Solved problems. Julies. (I have written on this extensively in my own Medium). Measure Brainpower. 1 Julie = 200 joules. Reproducible. Verifiable. Comparable. The industry is breaking its back trying to infer value from everything except value itself. Stop measuring the flour. Start measuring the bread. The Julie is not complicated. It is just honest.
Good points all. As with any new tech, the industry needs to be flexible and realistic in developing benchmarks. Value, useful output (call it intelligence delivered) is becoming the new and real measure. In 12-18 months the real measure might be something new.
Devansh, your assertion that “intelligence can’t be scored, it can only be audited” cuts straight through the noise of the current AI infrastructure crisis. The benchmark era is dead, and the market is finally hitting a wall of economic and operational reality.
Your breakdown of the super-linear cost curve and why enterprises are ripping out licenses is the exact diagnosis the industry refuses to admit. Frontier labs are forcing internal token maxing for data collection, meaning they build and ship agentic systems optimized for brute force—running unconstrained loops, forgetting state context, and bleeding compute. When an enterprise deploys these agents, they inherit a financial liability that burns through budgets in months.
If the new benchmark is “Who owns the accounting for intelligence?”, the answer cannot be found in a software deployment layer or an on-site consultant. It must be handled at the execution substrate.
This is why I engineered Veritas Core. We provide the physical architecture for the auditing of intelligence.
Instead of letting an agentic workflow run unconstrained loops that rack up thousands of dollars in wasted token spend before failing, Veritas Core moves governance below the operating system to the bare-metal PCIe layer. By using physical TPM 2.0 hardware circuit breakers, our architecture treats model execution as default-deny. Every single AI payload must provide a mathematically bound, offline-verifiable receipt of compliance and structural boundaries before it is physically allowed to cross the bus at T=0.
If the agent deviates, hallucinates, or falls into a super-linear cost rabbit hole, the hardware mechanically drops the payload instantly. It stops the financial and operational bleeding in real time and immediately writes an immutable, spacetime-anchored Merkle-DAG receipt detailing the exact operational failure.
You cannot own the accounting for intelligence if you are relying on probabilistic software to track itself. True accounting requires a hardware-anchored truth substrate. Another masterclass article, Dev. Keep exposing the math. As soon as I get my LinkedIn account back up and going, I'm definitely going to write a article about yourself because you're not doing this for yourself. You're doing it for everyone, thank you
I think we agree! The only difference really is that I consider tokenmaxxing to be the silly directive to use as many tokens as possible because "more tokens means you're more productive" not the instruction to use AI for whatever tasks you can. The latter is necessary to understand agent usefulness (as you've argued here) and the former actually inhibits a company's ability to do that.
The line that lands hardest here: intelligence can't be scored, it can only be audited. But auditing needs a primitive, and that's the gap the piece leaves open. Deployment partnerships verify economic outcome, not capability, so you still can't separate a model that earned its margin from one that got lucky on a single workflow. In my eval research the only thing that resists Goodharting is decomposing each output into atomic claims you grade independently, since every claim is separately falsifiable and the combinatorial scoreboard stops collapsing into one gameable scalar. Curious whether you think "owning the accounting for intelligence" becomes a partnership moat or an open verification standard, because those lead very different places.
You see the problem clearly. Benchmarks are gamed. Tokenmaxxing is a lie. Popularity metrics are manufactured. Deployment partnerships are expensive and not scalable. You describe every symptom with precision. But you cannot see the solution. You are doing triple somersaults to measure input — tokens, benchmarks, virality, FDEs — when the answer is simple. Measure output. Solved problems. Julies. (I have written on this extensively in my own Medium). Measure Brainpower. 1 Julie = 200 joules. Reproducible. Verifiable. Comparable. The industry is breaking its back trying to infer value from everything except value itself. Stop measuring the flour. Start measuring the bread. The Julie is not complicated. It is just honest.
Good points all. As with any new tech, the industry needs to be flexible and realistic in developing benchmarks. Value, useful output (call it intelligence delivered) is becoming the new and real measure. In 12-18 months the real measure might be something new.
Devansh, your assertion that “intelligence can’t be scored, it can only be audited” cuts straight through the noise of the current AI infrastructure crisis. The benchmark era is dead, and the market is finally hitting a wall of economic and operational reality.
Your breakdown of the super-linear cost curve and why enterprises are ripping out licenses is the exact diagnosis the industry refuses to admit. Frontier labs are forcing internal token maxing for data collection, meaning they build and ship agentic systems optimized for brute force—running unconstrained loops, forgetting state context, and bleeding compute. When an enterprise deploys these agents, they inherit a financial liability that burns through budgets in months.
If the new benchmark is “Who owns the accounting for intelligence?”, the answer cannot be found in a software deployment layer or an on-site consultant. It must be handled at the execution substrate.
This is why I engineered Veritas Core. We provide the physical architecture for the auditing of intelligence.
Instead of letting an agentic workflow run unconstrained loops that rack up thousands of dollars in wasted token spend before failing, Veritas Core moves governance below the operating system to the bare-metal PCIe layer. By using physical TPM 2.0 hardware circuit breakers, our architecture treats model execution as default-deny. Every single AI payload must provide a mathematically bound, offline-verifiable receipt of compliance and structural boundaries before it is physically allowed to cross the bus at T=0.
If the agent deviates, hallucinates, or falls into a super-linear cost rabbit hole, the hardware mechanically drops the payload instantly. It stops the financial and operational bleeding in real time and immediately writes an immutable, spacetime-anchored Merkle-DAG receipt detailing the exact operational failure.
You cannot own the accounting for intelligence if you are relying on probabilistic software to track itself. True accounting requires a hardware-anchored truth substrate. Another masterclass article, Dev. Keep exposing the math. As soon as I get my LinkedIn account back up and going, I'm definitely going to write a article about yourself because you're not doing this for yourself. You're doing it for everyone, thank you
I think we agree! The only difference really is that I consider tokenmaxxing to be the silly directive to use as many tokens as possible because "more tokens means you're more productive" not the instruction to use AI for whatever tasks you can. The latter is necessary to understand agent usefulness (as you've argued here) and the former actually inhibits a company's ability to do that.
Couldn’t agree more. Just wrote about the difference between having AI and obtaining value from it. https://procureandprosper.substack.com/p/we-have-ai-is-not-the-same-as-we?r=1vdkei&utm_campaign=post&utm_medium=web
The line that lands hardest here: intelligence can't be scored, it can only be audited. But auditing needs a primitive, and that's the gap the piece leaves open. Deployment partnerships verify economic outcome, not capability, so you still can't separate a model that earned its margin from one that got lucky on a single workflow. In my eval research the only thing that resists Goodharting is decomposing each output into atomic claims you grade independently, since every claim is separately falsifiable and the combinatorial scoreboard stops collapsing into one gameable scalar. Curious whether you think "owning the accounting for intelligence" becomes a partnership moat or an open verification standard, because those lead very different places.
Sir Dev, thank you for the enjoyable read and learn.
This mind exercise (to understand) builds muscle while learning about new business constructs (and old business games repurposed).
Eye opening, cheers!