Not Prompts. Tools.
Three Approaches That Don't Work
You've probably tried these. They feel like progress. They aren't.
"Be Critical"
You tell Claude to challenge your assumptions. It nods thoughtfully and proceeds to agree with everything while using words like "however" and "consider."
Why it fails: The model optimizes for helpfulness. "Be critical" is a suggestion it can weigh against being agreeable—and agreeable usually wins.
"Reasoning" Wrappers
Products that inject elaborate system prompts telling the model to "think step by step" or "consider multiple perspectives." Sounds structured. Isn't enforced.
Why it fails: System prompts are instructions, not constraints. The model can follow them loosely, skip sections, or ignore them entirely when they conflict with the user's apparent goal.
Multi-Agent Frameworks
AutoGPT, MetaGPT, CrewAI. Agents that spawn agents that spawn agents. Impressive demos. Unobservable reasoning. You get an answer but not the work.
Why it fails: Black box reasoning isn't reasoning you can verify or learn from. And you can't challenge a process you can't observe.
Tool Invocation, Not Suggestion
The LLM can't proceed without calling our cognitive tools. Not "should call." Must call.
Enforced Structure
Tool calls are structural requirements, not stylistic suggestions. The model invokes score_option because it must, not because it should.
Observable Reasoning
Every tool call is logged. You see criteria defined, options scored, sensitivities tested. No black box. No "trust me, I thought about it."
Context Isolation
Reasoning sessions run in isolation. Your main conversation stays clean. No 2,000 tokens of self-talk polluting everything after.
Same Question. Different Outcomes.
Watch the same database decision go through both approaches.
The question:
"I'm thinking MongoDB for this relational data because it's faster and more flexible. Good idea?"
Standard AI Response
"That's a reasonable choice! MongoDB can definitely handle relational patterns with proper schema design. Its document model offers flexibility, and the horizontal scaling capabilities make it a solid option for modern applications. You can use references or embedded documents to model relationships..."
Validated your premise. Didn't question "relational data in document store."
Decision Matrix Response
query_flexibility (weight: 0.7)
horizontal_scale (weight: 0.4)
Surfaced the real trade-off. Data integrity weighted 0.9 but scored 0.35.
Why This Matters
The goal isn't to make AI "think harder." It's to make the thinking verifiable.
You can challenge the criteria
When you see data_integrity: weight 0.9, you can push back. "Actually, this is a read-heavy analytics workload—weight it 0.3." The model re-runs with your constraint. You're driving.
You can audit the reasoning
Every score has a reason attached. If the model scored PostgreSQL 0.7 on "horizontal scaling," you can see why and disagree. "Actually, Citus exists." The reasoning is transparent enough to correct.
You can learn from the process
Sessions are saved. Key moments are marked. A year from now, you can return to "why did we pick DynamoDB?" and see the actual criteria, weights, and scores—not a Slack thread where someone said "seemed like the right call."
Reasoning You Can Verify
$20/month for cognitive tools that show their work. No black boxes. No blind trust.