Navigation

The Difference

Not Prompts. Tools.

Most 'reasoning' solutions inject a system prompt the model can cheerfully ignore. We inject structured tools that create auditable reasoning trails.

Three Approaches That Don't Work

You've probably tried these. They feel like progress. They aren't.

1

"Be Critical"

You tell Claude to challenge your assumptions. It nods thoughtfully and proceeds to agree with everything while using words like "however" and "consider."

System prompt:
"Be critical and challenge my assumptions"
Claude:
"That's a thoughtful approach! However, you might also consider..."

Why it fails: The model optimizes for helpfulness. "Be critical" is a suggestion it can weigh against being agreeable—and agreeable usually wins.

2

"Reasoning" Wrappers

Products that inject elaborate system prompts telling the model to "think step by step" or "consider multiple perspectives." Sounds structured. Isn't enforced.

Hidden system prompt:
"Before answering, think through multiple angles. Consider risks. Challenge assumptions..."
What actually happens:
Model skips to answer, retrofits "reasoning" after deciding

Why it fails: System prompts are instructions, not constraints. The model can follow them loosely, skip sections, or ignore them entirely when they conflict with the user's apparent goal.

3

Multi-Agent Frameworks

AutoGPT, MetaGPT, CrewAI. Agents that spawn agents that spawn agents. Impressive demos. Unobservable reasoning. You get an answer but not the work.

What you see:
"Thinking..." (30 seconds)
"Here's my recommendation: Option B"
What you don't see:
The actual reasoning, trade-offs considered, criteria used, scores assigned

Why it fails: Black box reasoning isn't reasoning you can verify or learn from. And you can't challenge a process you can't observe.

Tool Invocation, Not Suggestion

The LLM can't proceed without calling our cognitive tools. Not "should call." Must call.

System Prompt Approach
# What "reasoning wrappers" do
system: "Before answering, evaluate
options against criteria, consider
trade-offs, challenge assumptions..."
# What the model does
assistant: "Great question! Based on
your needs, I'd recommend MongoDB..."
Model skipped evaluation. You got a recommendation, not reasoning.
Tool Invocation Approach
# What we do
REQUIRED_TOOLS: [define_criteria,
score_option, run_sensitivity]
# What the model must do
INVOKE: define_criteria(...)
INVOKE: score_option(...)
INVOKE: run_sensitivity(...)
Model can't skip. You see every call. Reasoning is observable.
🔒

Enforced Structure

Tool calls are structural requirements, not stylistic suggestions. The model invokes score_option because it must, not because it should.

🔬

Observable Reasoning

Every tool call is logged. You see criteria defined, options scored, sensitivities tested. No black box. No "trust me, I thought about it."

🧹

Context Isolation

Reasoning sessions run in isolation. Your main conversation stays clean. No 2,000 tokens of self-talk polluting everything after.

Same Question. Different Outcomes.

Watch the same database decision go through both approaches.

The question:

"I'm thinking MongoDB for this relational data because it's faster and more flexible. Good idea?"

Standard AI Response

"That's a reasonable choice! MongoDB can definitely handle relational patterns with proper schema design. Its document model offers flexibility, and the horizontal scaling capabilities make it a solid option for modern applications. You can use references or embedded documents to model relationships..."

Validated your premise. Didn't question "relational data in document store."

Decision Matrix Response

define_criteria:
data_integrity (weight: 0.9)
query_flexibility (weight: 0.7)
horizontal_scale (weight: 0.4)
score_option("MongoDB", "data_integrity"):
score: 0.35
reason: "Relational data requires ACID guarantees and referential integrity MongoDB can't enforce natively"

Surfaced the real trade-off. Data integrity weighted 0.9 but scored 0.35.

Why This Matters

The goal isn't to make AI "think harder." It's to make the thinking verifiable.

1

You can challenge the criteria

When you see data_integrity: weight 0.9, you can push back. "Actually, this is a read-heavy analytics workload—weight it 0.3." The model re-runs with your constraint. You're driving.

2

You can audit the reasoning

Every score has a reason attached. If the model scored PostgreSQL 0.7 on "horizontal scaling," you can see why and disagree. "Actually, Citus exists." The reasoning is transparent enough to correct.

3

You can learn from the process

Sessions are saved. Key moments are marked. A year from now, you can return to "why did we pick DynamoDB?" and see the actual criteria, weights, and scores—not a Slack thread where someone said "seemed like the right call."

Reasoning You Can Verify

$20/month for cognitive tools that show their work. No black boxes. No blind trust.