Choosing a Model / Debrief · part 02 / ≈ 8 minutes

Choosing a Model.

Now that you've seen prompting matter, here's how to think about which model — and which harness — to reach for.

In this section

· The four-tier model spectrum
· Beyond the base model · RAG, tools, agents
· A decision tree for the next task

A model is one ingredient. The right model in the right harness, with the right prompt, is the actual unit of work.

§ 1 · The model spectrum

Four tiers,
one continuum.

Models trade off cost, latency, and depth of reasoning. Pick the lightest tier that still does the job — then move up only when the task needs it. The names below shift every six months; the shape of the spectrum doesn't.

Cost is per-token spend, relative within tier-of-task. Latency is rough wall-clock to a useful answer, not first token.

Tier 01 · Fast

Fast.

Claude Haiku
GPT-4o mini
Gemini Flash

Best for

High-volume, low-stakes shaping. Summaries, formatting, classification, autocomplete. Anything where you'd rather have ten passable answers than one perfect one.

Limitations

Shallow reasoning. Loses the thread on long context. Will not recover from a vague prompt the way a frontier model can.

Cost ms

Tier 02 · Workhorse

Workhorse.

Claude Sonnet 4.6
GPT-4o
Gemini Pro

Best for

Most clinical-adjacent work that doesn't require novel reasoning. Multi-step audits, drafting, structured extraction, role-based prompts like the three-lens chart review.

Limitations

Plateaus on genuinely hard or ambiguous cases. Will confidently produce plausible-but-wrong answers more than a reasoning model will.

Cost s

Tier 03 · Frontier

Frontier.

Claude Opus 4.5
GPT-5
Gemini Ultra

Best for

Cases where being right matters more than being fast. Subtle differential reasoning, long synthesis, novel writing, anything where you'd ask a senior colleague before acting.

Limitations

Slower, more expensive per call. Still hallucinates citations. Confidence still outpaces calibration on the long tail.

Cost s

Tier 04 · Reasoning

Reasoning.

Claude Opus 4.5 (extended)
OpenAI o3 / o4
Gemini Deep Think

Best for

Hard, multi-step problems where you'd accept 30+ seconds for a better answer. Complex case reasoning, untangling conflicting evidence, math-shaped diagnostics.

Limitations

Latency makes it the wrong tool for chat. Reasoning chains are not transcripts of clinical thought — don't read them as such.

Cost 30s+

From the debrief

A weak model with a great prompt beats a strong model with a vague one. GPT-3.5 Turbo jumped from 5% to 32% just by switching to the expert prompt.

From the debrief

Workhorse is the daily driver. Claude Sonnet 4.6 moved 59% → 77% with the three-lens prompt — solid for routine audit work.

From the debrief

Frontier is qualitatively different. Claude Opus 4.5 hit 95% with the expert prompt — review-grade only because of the prompt.

§ 2 · Beyond the base model

The harness
around the model.

A raw model only knows what it was trained on, only as of when training stopped, and only what fits in the prompt. Three patterns extend it. Each one shows up in the products you already use — knowing which is which tells you what the system can and can't do.

Pattern

Retrieval-Augmented Generation.RAG

The model searches a knowledge base — UpToDate, hospital protocols, internal guidelines — before answering. The retrieved text is appended to the prompt so the model answers with current, institution-specific information instead of whatever it absorbed during training.

flow:question → retrieve relevant docs → stuff into prompt → answer with citations

Solves the "training data is stale" and "doesn't know our protocols" problems. Doesn't fix hallucinated reasoning over the retrieved text.

Pattern

Tools.Function calling

Let the model call functions during a conversation — search the web, read PDFs, query a database, run Python. The model decides when and how. This turns a static text generator into something that can act on live information instead of guessing.

examples:ChatGPT Search · Claude with computer use · custom medical agents calling EHR APIs

Solves the "model is sealed off from the world" problem. Doesn't fix the model's judgement about which tools to use or when to stop calling them.

Pattern

Agents.Autonomous multi-step work

An LLM that takes multi-step actions on your behalf. Reads files, searches the web, drafts documents, submits forms. You give it a goal; it composes its own sequence of tool calls to get there.

example:this tutorial site, the slide deck, and the exercise app were all built by Claude — an AI agent that read existing materials, designed the pages, wrote the code, and deployed.

Solves the "human has to drive every step" problem. Trades direct control for leverage — the more autonomous the agent, the more you need a human-in-the-loop checkpoint before it touches anything that matters.

§ 3 · The decision tree

What are
you doing?

Start at the top, follow the branch that matches the actual task in front of you, and read the leaf as a recipe — model tier plus harness plus prompt style. The clinical-decision branch is highlighted because it's the only branch where a human-in-the-loop checkpoint is non-negotiable.

Routine paths Clinical decision · human-in-the-loop required Read top-to-bottom · branch on the actual task, not the model you happen to have open.

The model is the brain. The harness — RAG, tools, the interface you use — is the body. A great model in a bad harness underperforms a smaller model in a great one.

Takeaway · Choosing a Model

Debrief

You're nearly done

One short page left · ≈ 2 minutes

About

Choosing a Model.

Four tiers,one continuum.

The harnessaround the model.

What areyou doing?

Four tiers,
one continuum.

The harness
around the model.

What are
you doing?