aactingintern·prompting
Choosing a Model / Debrief · part 02 / ≈ 8 minutes
04

Choosing a Model.

Now that you've seen prompting matter, here's how to think about which model — and which harness — to reach for.

In this section
  • · The four-tier model spectrum
  • · Beyond the base model · RAG, tools, agents
  • · A decision tree for the next task

A model is one ingredient. The right model in the right harness, with the right prompt, is the actual unit of work.

§ 1 · The model spectrum

Four tiers,
one continuum.

Models trade off cost, latency, and depth of reasoning. Pick the lightest tier that still does the job — then move up only when the task needs it. The names below shift every six months; the shape of the spectrum doesn't.

Cost is per-token spend, relative within tier-of-task. Latency is rough wall-clock to a useful answer, not first token.

Tier 01 · Fast
Fast.
Claude Haiku
GPT-4o mini
Gemini Flash
High-volume, low-stakes shaping. Summaries, formatting, classification, autocomplete. Anything where you'd rather have ten passable answers than one perfect one.
Shallow reasoning. Loses the thread on long context. Will not recover from a vague prompt the way a frontier model can.
Cost ms
Tier 02 · Workhorse
Workhorse.
Claude Sonnet 4.6
GPT-4o
Gemini Pro
Most clinical-adjacent work that doesn't require novel reasoning. Multi-step audits, drafting, structured extraction, role-based prompts like the three-lens chart review.
Plateaus on genuinely hard or ambiguous cases. Will confidently produce plausible-but-wrong answers more than a reasoning model will.
Cost s
Tier 03 · Frontier
Frontier.
Claude Opus 4.5
GPT-5
Gemini Ultra
Cases where being right matters more than being fast. Subtle differential reasoning, long synthesis, novel writing, anything where you'd ask a senior colleague before acting.
Slower, more expensive per call. Still hallucinates citations. Confidence still outpaces calibration on the long tail.
Cost s
Tier 04 · Reasoning
Reasoning.
Claude Opus 4.5 (extended)
OpenAI o3 / o4
Gemini Deep Think
Hard, multi-step problems where you'd accept 30+ seconds for a better answer. Complex case reasoning, untangling conflicting evidence, math-shaped diagnostics.
Latency makes it the wrong tool for chat. Reasoning chains are not transcripts of clinical thought — don't read them as such.
Cost 30s+
From the debrief

A weak model with a great prompt beats a strong model with a vague one. GPT-3.5 Turbo jumped from 5% to 32% just by switching to the expert prompt.

From the debrief

Workhorse is the daily driver. Claude Sonnet 4.6 moved 59% → 77% with the three-lens prompt — solid for routine audit work.

From the debrief

Frontier is qualitatively different. Claude Opus 4.5 hit 95% with the expert prompt — review-grade only because of the prompt.

§ 2 · Beyond the base model

The harness
around the model.

A raw model only knows what it was trained on, only as of when training stopped, and only what fits in the prompt. Three patterns extend it. Each one shows up in the products you already use — knowing which is which tells you what the system can and can't do.

01
Pattern
Retrieval-Augmented Generation.RAG

The model searches a knowledge base — UpToDate, hospital protocols, internal guidelines — before answering. The retrieved text is appended to the prompt so the model answers with current, institution-specific information instead of whatever it absorbed during training.

flow:question → retrieve relevant docs → stuff into prompt → answer with citations
Solves the "training data is stale" and "doesn't know our protocols" problems. Doesn't fix hallucinated reasoning over the retrieved text.
02
Pattern
Tools.Function calling

Let the model call functions during a conversation — search the web, read PDFs, query a database, run Python. The model decides when and how. This turns a static text generator into something that can act on live information instead of guessing.

examples:ChatGPT Search · Claude with computer use · custom medical agents calling EHR APIs
Solves the "model is sealed off from the world" problem. Doesn't fix the model's judgement about which tools to use or when to stop calling them.
03
Pattern
Agents.Autonomous multi-step work

An LLM that takes multi-step actions on your behalf. Reads files, searches the web, drafts documents, submits forms. You give it a goal; it composes its own sequence of tool calls to get there.

example:this tutorial site, the slide deck, and the exercise app were all built by Claude — an AI agent that read existing materials, designed the pages, wrote the code, and deployed.
Solves the "human has to drive every step" problem. Trades direct control for leverage — the more autonomous the agent, the more you need a human-in-the-loop checkpoint before it touches anything that matters.
§ 3 · The decision tree

What are
you doing?

Start at the top, follow the branch that matches the actual task in front of you, and read the leaf as a recipe — model tier plus harness plus prompt style. The clinical-decision branch is highlighted because it's the only branch where a human-in-the-loop checkpoint is non-negotiable.

Step 01 · Start here What are you doing? one-off lookup multi-step audit novel reasoning clinical decision Leaf 01 · Tier 01 Fast model. One-shot prompt is fine. summary · classify · format Leaf 02 · Tier 02 Workhorse + role prompt. Three-lens style. audit · synthesis · extract Leaf 03 · Tier 03–04 Frontier, reasoning mode. Accept the latency. complex case · ambiguity Leaf 04 · Highest stakes Frontier + RAG. Cite institutional source. HUMAN-IN-THE-LOOP REQUIRED Prompt Prompt Prompt Prompt "Summarize this in 3 bullets." "You are a pharmacist…" Step-by-step, show reasoning. Three lenses + cite sources. Routine path Triage the output. Verify what's verifiable. Move on. You're the last line of defense. Clinical path A licensed clinician signs off before action. No exceptions, no shortcuts.
Routine paths Clinical decision · human-in-the-loop required Read top-to-bottom · branch on the actual task, not the model you happen to have open.

The model is the brain. The harness — RAG, tools, the interface you use — is the body. A great model in a bad harness underperforms a smaller model in a great one.

Takeaway · Choosing a Model
Debrief
You're nearly done
One short page left · ≈ 2 minutes
About