Deep Analysis Bench — Model Comparison
Why this benchmark exists
When a writer imports a book into TalePal, we run an analysis pass that extracts characters, plot events, and chapter summaries. Every later chat turn loads those outputs as context. If the analyzer invents a character that isn't in the manuscript, or attributes a quote that nobody actually said, every downstream feature inherits the lie — and the writer is the last to notice.
This benchmark measures analytical correctness: each model runs the analyzer (steps 1–5) against a real book, and the output is scored against a committed golden reference — a text-anchored truth file in the repo. The four rubrics below combine into an overall score, but the must-have gate is the real pass/fail: a model can be high-scoring overall and still flagged if one of the load-bearing facts is wrong.
For chat-side measurements (time-to-first-token, $/turn, hallucination tells in conversational replies) → Fast Pingpong Bench.
The contestants
3 models analyzed the same book with the same pipeline. Differences in pricing model and privacy posture come along for the ride.
-
gemini-2.5-flash Google GeminiCloud API · paidPay-per-token from the first request. Requests leave your machine.
-
google/gemma-4-e4b LM Studio LOCALLocal LLMAlways free, fully private — runs on your own machine, no data leaves the device. Speed depends on local hardware.This run: MacBook Air (M3)
-
gpt-4.1-mini OpenAICloud API · paidPay-per-token from the first request. Requests leave your machine.
Test corpus
The deep bench runs the analyzer against the dracula-4ch-bench — the opening chapters of Bram Stoker's Dracula. Output is scored against the committed golden reference dracula-4ch.golden.json in docs/bench/golden-reference/, which is human-curated and text-anchored.
- Chapter I · Jonathan Harker's Journal — arrival at the castle5,701 words
- Chapter II · Harker's Journal continued — the Count and the crucifix5,642 words
- Chapter III · Harker's Journal continued — locked in, the three women5,729 words
- Chapter IV · Harker's Journal continued — escape attempt; Mina enters5,895 words
- Total22,967 words
Hallucination resistance weight 30% · quote authenticity
What we test: for every quote the model attributes to a character, the scorer searches the source text for that exact string. The score is the share of attributed quotes that were found.
What we expect: ≥ 90 % verified. Inventing a quote that doesn't exist is the existential failure mode — see docs/technical/anti-hallucination-stage-1.md.
Score 90–100 = ≥ 90 % of attributed quotes verified in source; 70–90 = amber; below 70 = red.
Core correctness weight 30% · must-have gate
What we test: a fixed list of must-have facts curated from the gold. Each is a downstream-load-bearing claim — something a chat feature later assumes to be true. The score is the pass rate.
What we expect: 100 %. A single must-have failure flags the model regardless of overall score, because that one failure silently breaks a downstream feature in production.
Each column is a model's must-have pass rate (0–100). Below 90 indicates at least one downstream-breaking failure.
Coverage weight 20% · chapter occurrence
What we test: for each gold character, did the model correctly identify the chapters they appear in? This is recall — not whether the model named the right characters, but whether it saw all of their on-page appearances.
What we expect: ≥ 90 % of (character × chapter) gold-occurrences matched.
Score 90–100 = ≥ 90 % of gold character/chapter occurrences matched.
Cleanliness weight 20% · discriminators
What we test: the model output is checked for three discriminator issues — training-data leaks (characters not in the workspace), mentioned-only-as-present (characters that only appear by reference, but listed as on-page), and duplicate character entries (e.g. "The Count" and "Count Dracula" both extracted).
What we expect: zero of each. Each issue breaks a different downstream assumption — character filters, chat-context loading, and dashboard counts respectively.
Score 90–100 = no discriminator issues; below 70 = multiple leaks or duplicates.
Trust quadrant upper-right is better
What we plot: the two truth dimensions on one axis pair. X = Hallucination resistance (did the model avoid inventing?), Y = Coverage (did the model see everything that was actually there?). Both matter independently — a model that invents nothing but misses half the gold characters is just as broken as one that's complete but lies about it.
Dot colour reflects the must-have gate — green = all passed, amber = one failure, red = multiple failures.
X = Hallucination resistance, Y = Coverage. Shaded upper-right rectangle = both rubrics ≥ 90 %. Bottom-right = cautious but incomplete; top-left = thorough but invents; bottom-left = both broken.
Overview
All rubrics + must-have gate + overall percent + verdict, side-by-side.
| model | Halluc | Core | Cover | Clean | Must | Overall | Verdict |
|---|---|---|---|---|---|---|---|
| gemini-2.5-flashGoogle Gemini | 100 | 80 | 73 | 100 | 4/5 | 89 | FLAG |
| google/gemma-4-e4bLM Studio | 100 | 100 | 73 | 70 | 5/5 | 89 | PASS |
| gpt-4.1-miniOpenAI | 100 | 100 | 82 | 100 | 5/5 | 96 | PASS |
Overall ranking
Click any model below to see its detailed per-model results (per-rubric breakdown, must-have gate detail, character counts, and discriminator findings). Weighted: Hallucination 30% · Core 30% · Coverage 20% · Cleanliness 20%. The verdict reflects the must-have gate independently of overall — a model can be high-scoring and still flagged if a downstream-critical fact is wrong.
Notes
1 model flagged — gemini-2.5-flash did not pass the must-have gate or fell below an acceptable rubric threshold. See the per-model report for the specific failures.
google/gemma-4-e4b ran locally via LM Studio (always free, fully private — no data leaves the device; speed depends on local hardware: MacBook Air (M3)).