The compression versus procedural distinction you're drawing here cuts deeper than most people realize. What makes reasoning architectural rather than parametric is exactly what you're describing: the need for persistent state, conditional branching, and explicit backtracking. Training teaches pattern completion, but reasoning requires control flow that operates outside of token-by-token generation. The evolutionary valley analogy is especially apt—optimization pressure naturally selects against carrying forward the very uncertainty that enables genuine exploration. When you collapse the search tree into a single path during training, you're essentially teaching the model to approximate outcomes without learning the search algorithm. It's the difference between memorizing chess positions and understanding how to evaluate them dynamically. This is why infrastructure-based reasoning isn't just more efficient, it's categorically different.
This approach seems to be somewhat mirror what I consider one of the best practices paths for n-order thinking in humans:
1. Expand (Mutate): Consider many possibilities for first-order choices/effects
2. Prune (Score, Select): Evaluate the possibilities and cut off weak branches
- This is often either low probability or low consequence (or both), depending on what you're trying to accomplish - forecasting or risk mitigation.
- But it can be anything. Your idea of multiple judges applies here, it's just hard for humans to accurately evaluate on more than a 1-2 criteria at a time.
3. Branch (Repeat): For each possibility that's left, go back to step 1 for that branch (n+1 order choices/effects)
4. After you've reached your target or the number of options is overwhelming, choose the most likely or most consequential surviving possibilities (depending on your goal), and make judgments about which need attention.
- Note: The target can be a particular depth, a level of uncertainty (the probability estimates are too fuzzy), a clear winner (one branch clearly outweighs the others), etc.
In humans, without a good pruning step, the exponential nature makes the reasoning overwhelming after only a 2-3 levels. Even with good pruning, most humans have a hard time going past 4th-order thinking.
With machine reasoning, it can go deeper without getting overwhelmed. But as you said, why waste resources following weak paths? And the ability to trace the reasoning path, and identify why some paths were pruned or followed, is crucial; machine reasoning leaves a documentation trail that's extremely valuable in so many contexts.
I'm curious what conditions you use to decide the reasoning chain is finished. Do you have a separate overall evaluation (like step 4 in my process - is that your step 7?), or do you use the same scoring mechanism you use when pruning branches in some way?
I agree with the core point: forcing search, evaluation, and control flow into weights creates brittleness and hides variance. Externalized loops and explicit state clearly unlock things “thinking harder” inside a single pass never will.
Where I see it differently is closer to the end user. Today, most end users don’t actually know when they’re invoking reasoning at all. The router decides. The same prompt might return in seconds or minutes, with very different depth and cost, and the user has no visibility into which mode they’re in or why. It’s like handing someone a RED 5K camera in full auto. The footage looks incredible, but they’re still thinking like they’re shooting on a phone. Or using an SLR as point-and-click and assuming the quality just “happened.”
In practice, we’re already in a hybrid world. Stronger in-model reasoning improves baseline usefulness and keeps everyday interactions fast. External reasoning infrastructure earns its keep when exploration, verification, or backtracking actually matter. The most common failure mode I see isn’t where reasoning lives. It’s unintentional depth.
Reasoning has a place. Like the rest of AI, it isn’t a magic “does everything” button. But it’s also too early to declare time of death in the court of public opinion while end users are still learning to crawl and walk.
Really interesting piece Devansh!
Thank you
Thank you for this excellent article. I agree. LLMs can generate language but thought is computational and requires discrete steps.
Yep
The compression versus procedural distinction you're drawing here cuts deeper than most people realize. What makes reasoning architectural rather than parametric is exactly what you're describing: the need for persistent state, conditional branching, and explicit backtracking. Training teaches pattern completion, but reasoning requires control flow that operates outside of token-by-token generation. The evolutionary valley analogy is especially apt—optimization pressure naturally selects against carrying forward the very uncertainty that enables genuine exploration. When you collapse the search tree into a single path during training, you're essentially teaching the model to approximate outcomes without learning the search algorithm. It's the difference between memorizing chess positions and understanding how to evaluate them dynamically. This is why infrastructure-based reasoning isn't just more efficient, it's categorically different.
yep
Very nice.
This approach seems to be somewhat mirror what I consider one of the best practices paths for n-order thinking in humans:
1. Expand (Mutate): Consider many possibilities for first-order choices/effects
2. Prune (Score, Select): Evaluate the possibilities and cut off weak branches
- This is often either low probability or low consequence (or both), depending on what you're trying to accomplish - forecasting or risk mitigation.
- But it can be anything. Your idea of multiple judges applies here, it's just hard for humans to accurately evaluate on more than a 1-2 criteria at a time.
3. Branch (Repeat): For each possibility that's left, go back to step 1 for that branch (n+1 order choices/effects)
4. After you've reached your target or the number of options is overwhelming, choose the most likely or most consequential surviving possibilities (depending on your goal), and make judgments about which need attention.
- Note: The target can be a particular depth, a level of uncertainty (the probability estimates are too fuzzy), a clear winner (one branch clearly outweighs the others), etc.
In humans, without a good pruning step, the exponential nature makes the reasoning overwhelming after only a 2-3 levels. Even with good pruning, most humans have a hard time going past 4th-order thinking.
With machine reasoning, it can go deeper without getting overwhelmed. But as you said, why waste resources following weak paths? And the ability to trace the reasoning path, and identify why some paths were pruned or followed, is crucial; machine reasoning leaves a documentation trail that's extremely valuable in so many contexts.
I'm curious what conditions you use to decide the reasoning chain is finished. Do you have a separate overall evaluation (like step 4 in my process - is that your step 7?), or do you use the same scoring mechanism you use when pruning branches in some way?
You have two sets of conditions--
Compute constraints-- put stopping criteria (these many chains, time etc).
Judges that evaluate the solution and see if good enough.
Combining both is a good approach.
I agree with the core point: forcing search, evaluation, and control flow into weights creates brittleness and hides variance. Externalized loops and explicit state clearly unlock things “thinking harder” inside a single pass never will.
Where I see it differently is closer to the end user. Today, most end users don’t actually know when they’re invoking reasoning at all. The router decides. The same prompt might return in seconds or minutes, with very different depth and cost, and the user has no visibility into which mode they’re in or why. It’s like handing someone a RED 5K camera in full auto. The footage looks incredible, but they’re still thinking like they’re shooting on a phone. Or using an SLR as point-and-click and assuming the quality just “happened.”
In practice, we’re already in a hybrid world. Stronger in-model reasoning improves baseline usefulness and keeps everyday interactions fast. External reasoning infrastructure earns its keep when exploration, verification, or backtracking actually matter. The most common failure mode I see isn’t where reasoning lives. It’s unintentional depth.
Reasoning has a place. Like the rest of AI, it isn’t a magic “does everything” button. But it’s also too early to declare time of death in the court of public opinion while end users are still learning to crawl and walk.