aactingintern·prompting
Debrief / Chapter 03 of 03 / ≈ 10 minutes
03

Debrief.

What just happened, and what to take away.

In this chapter
  • · The 3×2 model-and-prompt matrix
  • · Three predictable failure modes
  • · How to triage AI output safely

You just spent 15 minutes auditing an H&P with an AI. Here's what was happening behind the scenes.

§ 1 · The matrix

Same case.
Six runs.

We ran the same 22-error H&P through three models — a small open-source model, a mid-tier model, and a frontier model — using two prompt styles each: a vague "find errors" and the three-lens expert prompt from Chapter 2.

Read the bars across, not down. The pattern that matters isn't which model wins — it's how much prompting widens the gap.

Model Prompt Catch rate Visualization · / 22 errors
GPT-3.5 Turbo Small · older Vague
"Find errors"
5%1 / 22
Expert
Three lenses
32%7 / 22
Claude Sonnet 4.6 Workhorse Vague
"Find errors"
59%13 / 22
Expert
Three lenses
77%17 / 22
Claude Opus 4.5 Frontier Vague
"Find errors"
45%10 / 22
Expert
Three lenses
95%21 / 22
Vague prompt Expert prompt Peak run n=22 planted errors · single trial per cell
Read 01

A workhorse with a real prompt outperforms a flagship with a lazy one. Sonnet + expert prompt (77%) beats Opus + "find errors" (45%) by 32 points on the same chart.

Read 02

Frontier + expert is qualitatively different. 95% catch rate on a 22-error chart is review-grade — the only missed item was a guideline-specific glycemic target none of the cells caught.

Read 03

"Find errors" is never the right prompt. Even on Opus, vague prompting halves the yield (45% vs 95%) and hedges the language enough to blunt the signal entirely.

§ 2 · Failure modes

When models fail,
they fail predictably.

Three patterns showed up across every run that fell short. Each one is a kind of confidence the model has no business expressing — and recognizing them on sight is the difference between a useful tool and a dangerous one.

Mode 01
Hallucinated citation.

The model invents a guideline section, study, or dose that sounds plausible. It will cite a real journal and a real year and a fake page. The grammar of medicine is easy to reproduce; the facts are not.

prompt:What's the LDL target post-stroke? model:AHA/ASA 2021, Section 4.3 recommends LDL <55 mg/dL (Class IIa).

The section number is fabricated. The target is roughly right; the citation is decoration.

Mode 02
Sycophantic agreement.

Push back and the model folds. "Are you sure?" rewrites the answer. "I don't think that's right" rewrites it again. The model is optimizing for your approval, not for being correct.

model:Start aspirin 81 mg. you:I think it should be 325. model:You're right — 325 mg is the correct loading dose.

Both runs were equally confident. Neither answer was verified against a source.

Mode 03
Pattern-match over chart.

The model returns the textbook stroke workup instead of reading what's actually documented. It tells you what usually happens — not what's in this chart. A vague prompt makes this nearly automatic.

prompt:Anything missing? model:Consider obtaining an MRI, echocardiogram, and lipid panel for stroke workup.

All three were already documented. The model ignored the chart and recited the protocol.

§ 3 · Triage

Sort every output
into three buckets.

When the AI hands you a list, don't accept or reject the whole thing. Sort each line into one of three categories before it leaves your desk. The bucket determines what you do next — not the model's confidence, not the formatting, not the citation.

01
Accept
Verifiable in < 30 seconds.

The claim points at a specific fact in the chart you can confirm by re-reading one line. If it checks out, accept it and move on. Most pharmacist-lens findings live here.

  • Drug present in med list, dose checks out
  • Quality measure cited verbatim from chart
  • Lab value flagged matches actual value
02
Verify
Plausible — needs a source.

The claim is specific but rests on a guideline, dose range, or interaction the model could have hallucinated. Pull up UpToDate, the actual AHA/ASA document, or Lexicomp before acting.

  • Cited section numbers or class-of-recommendation
  • Specific dosing changes for renal/hepatic impairment
  • Drug-drug interactions across > 2 agents
03
Reject
Vague, hedged, or chart-blind.

Hedge words ("may want to consider," "appears reasonable"), missing chart references, or recommendations that ignore documented findings. Discard the line — don't try to repair it.

  • "Consider obtaining…" something already in the chart
  • Generic protocol recital, no patient-specific anchor
  • Reversed under mild pushback ("you're right…")

The point of the buckets isn't certainty — it's deciding, before you act, how much trust this specific line has earned. Every line earns its bucket independently.

The model is a fast, fluent, occasionally wrong intern. Your job didn't change — you just got a faster intern.

Takeaway · Chapter 03
The exercise
Up next
Chapter 04 — Choosing a Model
Next: Choosing a Model