This approach seems to be somewhat mirror what I consider one of the best practices paths for n-order thinking in humans:
1. Expand (Mutate): Consider many possibilities for first-order choices/effects
2. Prune (Score, Select): Evaluate the possibilities and cut off weak branches
- This is often either low probability or low consequence (or both), depending on what you're trying to accomplish - forecasting or risk mitigation.
- But it can be anything. Your idea of multiple judges applies here, it's just hard for humans to accurately evaluate on more than a 1-2 criteria at a time.
3. Branch (Repeat): For each possibility that's left, go back to step 1 for that branch (n+1 order choices/effects)
4. After you've reached your target or the number of options is overwhelming, choose the most likely or most consequential surviving possibilities (depending on your goal), and make judgments about which need attention.
- Note: The target can be a particular depth, a level of uncertainty (the probability estimates are too fuzzy), a clear winner (one branch clearly outweighs the others), etc.
In humans, without a good pruning step, the exponential nature makes the reasoning overwhelming after only a 2-3 levels. Even with good pruning, most humans have a hard time going past 4th-order thinking.
With machine reasoning, it can go deeper without getting overwhelmed. But as you said, why waste resources following weak paths? And the ability to trace the reasoning path, and identify why some paths were pruned or followed, is crucial; machine reasoning leaves a documentation trail that's extremely valuable in so many contexts.
I'm curious what conditions you use to decide the reasoning chain is finished. Do you have a separate overall evaluation (like step 4 in my process - is that your step 7?), or do you use the same scoring mechanism you use when pruning branches in some way?
I agree with the core point: forcing search, evaluation, and control flow into weights creates brittleness and hides variance. Externalized loops and explicit state clearly unlock things “thinking harder” inside a single pass never will.
Where I see it differently is closer to the end user. Today, most end users don’t actually know when they’re invoking reasoning at all. The router decides. The same prompt might return in seconds or minutes, with very different depth and cost, and the user has no visibility into which mode they’re in or why. It’s like handing someone a RED 5K camera in full auto. The footage looks incredible, but they’re still thinking like they’re shooting on a phone. Or using an SLR as point-and-click and assuming the quality just “happened.”
In practice, we’re already in a hybrid world. Stronger in-model reasoning improves baseline usefulness and keeps everyday interactions fast. External reasoning infrastructure earns its keep when exploration, verification, or backtracking actually matter. The most common failure mode I see isn’t where reasoning lives. It’s unintentional depth.
Reasoning has a place. Like the rest of AI, it isn’t a magic “does everything” button. But it’s also too early to declare time of death in the court of public opinion while end users are still learning to crawl and walk.
Well articulated insights, and very timely discussion. I agree with your observation of the problem with bolt on plausibility-driven rationalization, like CoT. Thank you for sharing your thoughts.
I might push farther in my hope for “AI”, in that I do think that we, as humans, both think and communicate our thinkings, leveraging logics of various types to discipline our own analyses , inferences and argumentation. Frequently we leverage multiple logics concurrently, in order to formulate argumentation and evaluation of different aspects of a problem or a concept. There are also instances where different argumentations based on different logics, act in tension, to inform our decisions and choices.
An “AI” platform that flexibly captures and faithfully processes multiple complementary and possibly competing logics would, in my mind, begin to enable automation that is explainable in terms of how we actually reason, as cooperating, collaborating and debating communities of humans, facing complex and evolving understandings of problems, systems and phenomena.
Again thanks for sharing on this very timely issue.
Really interesting piece Devansh!
Thank you
Thank you for this excellent article. I agree. LLMs can generate language but thought is computational and requires discrete steps.
Yep
Very nice.
This approach seems to be somewhat mirror what I consider one of the best practices paths for n-order thinking in humans:
1. Expand (Mutate): Consider many possibilities for first-order choices/effects
2. Prune (Score, Select): Evaluate the possibilities and cut off weak branches
- This is often either low probability or low consequence (or both), depending on what you're trying to accomplish - forecasting or risk mitigation.
- But it can be anything. Your idea of multiple judges applies here, it's just hard for humans to accurately evaluate on more than a 1-2 criteria at a time.
3. Branch (Repeat): For each possibility that's left, go back to step 1 for that branch (n+1 order choices/effects)
4. After you've reached your target or the number of options is overwhelming, choose the most likely or most consequential surviving possibilities (depending on your goal), and make judgments about which need attention.
- Note: The target can be a particular depth, a level of uncertainty (the probability estimates are too fuzzy), a clear winner (one branch clearly outweighs the others), etc.
In humans, without a good pruning step, the exponential nature makes the reasoning overwhelming after only a 2-3 levels. Even with good pruning, most humans have a hard time going past 4th-order thinking.
With machine reasoning, it can go deeper without getting overwhelmed. But as you said, why waste resources following weak paths? And the ability to trace the reasoning path, and identify why some paths were pruned or followed, is crucial; machine reasoning leaves a documentation trail that's extremely valuable in so many contexts.
I'm curious what conditions you use to decide the reasoning chain is finished. Do you have a separate overall evaluation (like step 4 in my process - is that your step 7?), or do you use the same scoring mechanism you use when pruning branches in some way?
You have two sets of conditions--
Compute constraints-- put stopping criteria (these many chains, time etc).
Judges that evaluate the solution and see if good enough.
Combining both is a good approach.
I agree with the core point: forcing search, evaluation, and control flow into weights creates brittleness and hides variance. Externalized loops and explicit state clearly unlock things “thinking harder” inside a single pass never will.
Where I see it differently is closer to the end user. Today, most end users don’t actually know when they’re invoking reasoning at all. The router decides. The same prompt might return in seconds or minutes, with very different depth and cost, and the user has no visibility into which mode they’re in or why. It’s like handing someone a RED 5K camera in full auto. The footage looks incredible, but they’re still thinking like they’re shooting on a phone. Or using an SLR as point-and-click and assuming the quality just “happened.”
In practice, we’re already in a hybrid world. Stronger in-model reasoning improves baseline usefulness and keeps everyday interactions fast. External reasoning infrastructure earns its keep when exploration, verification, or backtracking actually matter. The most common failure mode I see isn’t where reasoning lives. It’s unintentional depth.
Reasoning has a place. Like the rest of AI, it isn’t a magic “does everything” button. But it’s also too early to declare time of death in the court of public opinion while end users are still learning to crawl and walk.
Well articulated insights, and very timely discussion. I agree with your observation of the problem with bolt on plausibility-driven rationalization, like CoT. Thank you for sharing your thoughts.
I might push farther in my hope for “AI”, in that I do think that we, as humans, both think and communicate our thinkings, leveraging logics of various types to discipline our own analyses , inferences and argumentation. Frequently we leverage multiple logics concurrently, in order to formulate argumentation and evaluation of different aspects of a problem or a concept. There are also instances where different argumentations based on different logics, act in tension, to inform our decisions and choices.
An “AI” platform that flexibly captures and faithfully processes multiple complementary and possibly competing logics would, in my mind, begin to enable automation that is explainable in terms of how we actually reason, as cooperating, collaborating and debating communities of humans, facing complex and evolving understandings of problems, systems and phenomena.
Again thanks for sharing on this very timely issue.
yep