What just happened, and what to take away.
You just spent 15 minutes auditing an H&P with an AI. Here's what was happening behind the scenes.
We ran the same 22-error H&P through three models — a small open-source model, a mid-tier model, and a frontier model — using two prompt styles each: a vague "find errors" and the three-lens expert prompt from Chapter 2.
Read the bars across, not down. The pattern that matters isn't which model wins — it's how much prompting widens the gap.
| Model | Prompt | Catch rate | Visualization · / 22 errors |
|---|---|---|---|
| GPT-3.5 Turbo Small · older | Vague "Find errors" |
5%1 / 22 | |
| Expert Three lenses |
32%7 / 22 | ||
| Claude Sonnet 4.6 Workhorse | Vague "Find errors" |
59%13 / 22 | |
| Expert Three lenses |
77%17 / 22 | ||
| Claude Opus 4.5 Frontier | Vague "Find errors" |
45%10 / 22 | |
| Expert Three lenses |
95%21 / 22 |
A workhorse with a real prompt outperforms a flagship with a lazy one. Sonnet + expert prompt (77%) beats Opus + "find errors" (45%) by 32 points on the same chart.
Frontier + expert is qualitatively different. 95% catch rate on a 22-error chart is review-grade — the only missed item was a guideline-specific glycemic target none of the cells caught.
"Find errors" is never the right prompt. Even on Opus, vague prompting halves the yield (45% vs 95%) and hedges the language enough to blunt the signal entirely.
Three patterns showed up across every run that fell short. Each one is a kind of confidence the model has no business expressing — and recognizing them on sight is the difference between a useful tool and a dangerous one.
The model invents a guideline section, study, or dose that sounds plausible. It will cite a real journal and a real year and a fake page. The grammar of medicine is easy to reproduce; the facts are not.
The section number is fabricated. The target is roughly right; the citation is decoration.
Push back and the model folds. "Are you sure?" rewrites the answer. "I don't think that's right" rewrites it again. The model is optimizing for your approval, not for being correct.
Both runs were equally confident. Neither answer was verified against a source.
The model returns the textbook stroke workup instead of reading what's actually documented. It tells you what usually happens — not what's in this chart. A vague prompt makes this nearly automatic.
All three were already documented. The model ignored the chart and recited the protocol.
When the AI hands you a list, don't accept or reject the whole thing. Sort each line into one of three categories before it leaves your desk. The bucket determines what you do next — not the model's confidence, not the formatting, not the citation.
The claim points at a specific fact in the chart you can confirm by re-reading one line. If it checks out, accept it and move on. Most pharmacist-lens findings live here.
The claim is specific but rests on a guideline, dose range, or interaction the model could have hallucinated. Pull up UpToDate, the actual AHA/ASA document, or Lexicomp before acting.
Hedge words ("may want to consider," "appears reasonable"), missing chart references, or recommendations that ignore documented findings. Discard the line — don't try to repair it.
The point of the buckets isn't certainty — it's deciding, before you act, how much trust this specific line has earned. Every line earns its bucket independently.
The model is a fast, fluent, occasionally wrong intern. Your job didn't change — you just got a faster intern.