Debrief / Chapter 03 of 03 / ≈ 10 minutes

Debrief.

What just happened, and what to take away.

In this chapter

· The 3×2 model-and-prompt matrix
· Three predictable failure modes
· How to triage AI output safely

You just spent 15 minutes auditing an H&P with an AI. Here's what was happening behind the scenes.

§ 1 · The matrix

Same case.
Six runs.

We ran the same 22-error H&P through three models — a small open-source model, a mid-tier model, and a frontier model — using two prompt styles each: a vague "find errors" and the three-lens expert prompt from Chapter 2.

Read the bars across, not down. The pattern that matters isn't which model wins — it's how much prompting widens the gap.

Model	Prompt	Catch rate
GPT-3.5 Turbo Small · older	Vague "Find errors"	5%1 / 22
GPT-3.5 Turbo Small · older	Expert Three lenses	32%7 / 22
Claude Sonnet 4.6 Workhorse	Vague "Find errors"	59%13 / 22
Claude Sonnet 4.6 Workhorse	Expert Three lenses	77%17 / 22
Claude Opus 4.5 Frontier	Vague "Find errors"	45%10 / 22
Claude Opus 4.5 Frontier	Expert Three lenses	95%21 / 22

Vague prompt Expert prompt Peak run n=22 planted errors · single trial per cell

A workhorse with a real prompt outperforms a flagship with a lazy one. Sonnet + expert prompt (77%) beats Opus + "find errors" (45%) by 32 points on the same chart.

Frontier + expert is qualitatively different. 95% catch rate on a 22-error chart is review-grade — the only missed item was a guideline-specific glycemic target none of the cells caught.

"Find errors" is never the right prompt. Even on Opus, vague prompting halves the yield (45% vs 95%) and hedges the language enough to blunt the signal entirely.

§ 2 · Failure modes

When models fail,
they fail predictably.

Three patterns showed up across every run that fell short. Each one is a kind of confidence the model has no business expressing — and recognizing them on sight is the difference between a useful tool and a dangerous one.

Mode 01

Hallucinated citation.

The model invents a guideline section, study, or dose that sounds plausible. It will cite a real journal and a real year and a fake page. The grammar of medicine is easy to reproduce; the facts are not.

prompt:What's the LDL target post-stroke? model:AHA/ASA 2021, Section 4.3 recommends LDL <55 mg/dL (Class IIa).

The section number is fabricated. The target is roughly right; the citation is decoration.

Mode 02

Sycophantic agreement.

Push back and the model folds. "Are you sure?" rewrites the answer. "I don't think that's right" rewrites it again. The model is optimizing for your approval, not for being correct.

model:Start aspirin 81 mg. you:I think it should be 325. model:You're right — 325 mg is the correct loading dose.

Both runs were equally confident. Neither answer was verified against a source.

Mode 03

Pattern-match over chart.

The model returns the textbook stroke workup instead of reading what's actually documented. It tells you what usually happens — not what's in this chart. A vague prompt makes this nearly automatic.

prompt:Anything missing? model:Consider obtaining an MRI, echocardiogram, and lipid panel for stroke workup.

All three were already documented. The model ignored the chart and recited the protocol.

§ 3 · Triage

Sort every output
into three buckets.

When the AI hands you a list, don't accept or reject the whole thing. Sort each line into one of three categories before it leaves your desk. The bucket determines what you do next — not the model's confidence, not the formatting, not the citation.

Verifiable in < 30 seconds.

The claim points at a specific fact in the chart you can confirm by re-reading one line. If it checks out, accept it and move on. Most pharmacist-lens findings live here.

Drug present in med list, dose checks out
Quality measure cited verbatim from chart
Lab value flagged matches actual value

Verify

Plausible — needs a source.

The claim is specific but rests on a guideline, dose range, or interaction the model could have hallucinated. Pull up UpToDate, the actual AHA/ASA document, or Lexicomp before acting.

Cited section numbers or class-of-recommendation
Specific dosing changes for renal/hepatic impairment
Drug-drug interactions across > 2 agents

Reject

Vague, hedged, or chart-blind.

Hedge words ("may want to consider," "appears reasonable"), missing chart references, or recommendations that ignore documented findings. Discard the line — don't try to repair it.

"Consider obtaining…" something already in the chart
Generic protocol recital, no patient-specific anchor
Reversed under mild pushback ("you're right…")

The point of the buckets isn't certainty — it's deciding, before you act, how much trust this specific line has earned. Every line earns its bucket independently.

The model is a fast, fluent, occasionally wrong intern. Your job didn't change — you just got a faster intern.

Takeaway · Chapter 03

The exercise

Up next

Chapter 04 — Choosing a Model

Next: Choosing a Model

Debrief.

Same case.Six runs.

When models fail,they fail predictably.

Sort every outputinto three buckets.

Same case.
Six runs.

When models fail,
they fail predictably.

Sort every output
into three buckets.