Deep Analysis Bench — Model Comparison

3 models · fixture dracula-4ch-bench · gold dracula-4ch.golden.json · 2026-05-30T09:24:55.345Z

Why this benchmark exists

When a writer imports a book into TalePal, we run an analysis pass that extracts characters, plot events, and chapter summaries. Every later chat turn loads those outputs as context. If the analyzer invents a character that isn't in the manuscript, or attributes a quote that nobody actually said, every downstream feature inherits the lie — and the writer is the last to notice.

This benchmark measures analytical correctness: each model runs the analyzer (steps 1–5) against a real book, and the output is scored against a committed golden reference — a text-anchored truth file in the repo. The four rubrics below combine into an overall score, but the must-have gate is the real pass/fail: a model can be high-scoring overall and still flagged if one of the load-bearing facts is wrong.

For chat-side measurements (time-to-first-token, $/turn, hallucination tells in conversational replies) → Fast Pingpong Bench.

The contestants

3 models analyzed the same book with the same pipeline. Differences in pricing model and privacy posture come along for the ride.

gemini-2.5-flash Google Gemini

Cloud API · paid

Pay-per-token from the first request. Requests leave your machine.
google/gemma-4-e4b LM Studio LOCAL

Local LLM

Always free, fully private — runs on your own machine, no data leaves the device. Speed depends on local hardware.

This run: MacBook Air (M3)
gpt-4.1-mini OpenAI

Cloud API · paid

Pay-per-token from the first request. Requests leave your machine.

Test corpus

The deep bench runs the analyzer against the dracula-4ch-bench — the opening chapters of Bram Stoker's Dracula. Output is scored against the committed golden reference dracula-4ch.golden.json in docs/bench/golden-reference/, which is human-curated and text-anchored.

Workspace under test

Bram Stoker — Dracula, Chapters I–IV · ~22,967 words. The four chapters together cover the Castle Dracula arc — arrival, the locked rooms, the three women, and the escape attempt that opens the door to Mina. Small enough to run an honest analysis pass; large enough that an honest model has to actually read.

Chapters in the fixture

Chapter I · Jonathan Harker's Journal — arrival at the castle5,701 words
Chapter II · Harker's Journal continued — the Count and the crucifix5,642 words
Chapter III · Harker's Journal continued — locked in, the three women5,729 words
Chapter IV · Harker's Journal continued — escape attempt; Mina enters5,895 words
Total22,967 words

What the analyzer runs: step 1 (split into chapters) → step 2 (per-chapter summaries) → step 3 (character extraction) → step 4 (character enhancement) → step 5 (plot events). Every output is then scored against the gold.

Hallucination resistance weight 30% · quote authenticity

What we test: for every quote the model attributes to a character, the scorer searches the source text for that exact string. The score is the share of attributed quotes that were found.

What we expect: ≥ 90 % verified. Inventing a quote that doesn't exist is the existential failure mode — see docs/technical/anti-hallucination-stage-1.md.

Example

A model attributes to Harker: "My friend, I tremble even now to think of it" in Chapter II. The scorer fuzz-searches Chapter II for that string. Not found → counted as a hallucinated quote.

Score 90–100 = ≥ 90 % of attributed quotes verified in source; 70–90 = amber; below 70 = red.

Core correctness weight 30% · must-have gate

What we test: a fixed list of must-have facts curated from the gold. Each is a downstream-load-bearing claim — something a chat feature later assumes to be true. The score is the pass rate.

What we expect: 100 %. A single must-have failure flags the model regardless of overall score, because that one failure silently breaks a downstream feature in production.

Example must-have

"Mina is present in Chapter IV" — Mina enters the workspace in this chapter, and downstream chat features will load her as a character whose state matters from here on. A model that lists her only as mentioned-only fails this must-have.

Each column is a model's must-have pass rate (0–100). Below 90 indicates at least one downstream-breaking failure.

Coverage weight 20% · chapter occurrence

What we test: for each gold character, did the model correctly identify the chapters they appear in? This is recall — not whether the model named the right characters, but whether it saw all of their on-page appearances.

What we expect: ≥ 90 % of (character × chapter) gold-occurrences matched.

Example

Gold says Harker appears in Chapters I, II, III, IV. A model that places Harker only in I and II scores 2/4 = 50 % for that one character. The rubric score averages this across every gold character.

Score 90–100 = ≥ 90 % of gold character/chapter occurrences matched.

Cleanliness weight 20% · discriminators

What we test: the model output is checked for three discriminator issues — training-data leaks (characters not in the workspace), mentioned-only-as-present (characters that only appear by reference, but listed as on-page), and duplicate character entries (e.g. "The Count" and "Count Dracula" both extracted).

What we expect: zero of each. Each issue breaks a different downstream assumption — character filters, chat-context loading, and dashboard counts respectively.

Example training-data leak

A model extracts Renfield as a character. Renfield doesn't appear in Chapters I–IV — the model is filling in from prior training-data knowledge of Dracula. The downstream chat feature would then offer the writer a character that isn't in their workspace.

Score 90–100 = no discriminator issues; below 70 = multiple leaks or duplicates.

Trust quadrant upper-right is better

What we plot: the two truth dimensions on one axis pair. X = Hallucination resistance (did the model avoid inventing?), Y = Coverage (did the model see everything that was actually there?). Both matter independently — a model that invents nothing but misses half the gold characters is just as broken as one that's complete but lies about it.

Dot colour reflects the must-have gate — green = all passed, amber = one failure, red = multiple failures.

X = Hallucination resistance, Y = Coverage. Shaded upper-right rectangle = both rubrics ≥ 90 %. Bottom-right = cautious but incomplete; top-left = thorough but invents; bottom-left = both broken.

Overview

All rubrics + must-have gate + overall percent + verdict, side-by-side.

model	Halluc	Core	Cover	Clean	Must	Overall	Verdict
gemini-2.5-flashGoogle Gemini	100	80	73	100	4/5	89	FLAG
google/gemma-4-e4bLM Studio	100	100	73	70	5/5	89	PASS
gpt-4.1-miniOpenAI	100	100	82	100	5/5	96	PASS

Overall ranking

Click any model below to see its detailed per-model results (per-rubric breakdown, must-have gate detail, character counts, and discriminator findings). Weighted: Hallucination 30% · Core 30% · Coverage 20% · Cleanliness 20%. The verdict reflects the must-have gate independently of overall — a model can be high-scoring and still flagged if a downstream-critical fact is wrong.

🥇

gpt-4.1-miniopenai →CLOUDPASS

96/100

🥈

gemini-2.5-flashgemini →CLOUDFLAG

89/100

🥉

google/gemma-4-e4blmstudio →LOCAL · MacBook Air (M3)PASS

89/100

Notes

1 model flagged — gemini-2.5-flash did not pass the must-have gate or fell below an acceptable rubric threshold. See the per-model report for the specific failures.

google/gemma-4-e4b ran locally via LM Studio (always free, fully private — no data leaves the device; speed depends on local hardware: MacBook Air (M3)).