Deep Analysis Bench — Detail

fixture dracula-4ch-bench · 2026-05-30T09:24:55.345Z · openai

openai gpt-4.1-mini

Pros no hallucinated quotes · protagonist + antagonist correct · all 5 must-haves · no training-data leak

Cons chapter-occurrence 81.8% · over-classifies main (3 vs 2)

Overall Rating

/100

Rubric breakdown the four weighted components of the overall score

Each rubric is computed independently and weighted into the overall percent at the top. Weights are deliberately simple — tune as we learn. See the comparison page for what each rubric tests.

Hallucination resistance 30% 100

Core correctness 30% 100

Coverage 20% 82

Cleanliness 20% 100

Quality (text-anchored)

1/2

The two metrics whose truth is verifiable by string-matching against the source chapters: quote authenticity (did every attributed quote actually appear?) and chapter-occurrence (did the model place each character in the right chapters?).

Quote authenticity 44/44 100.0%

Chapter-occurrence 9/11 81.8%

findings

all attributed quotes verified in source

Must-have gate

5/5

A fixed list of facts that must be correct because downstream chat features assume them. A single failure here flags the model regardless of overall score.

harker_protagonist 1/1 Harker primaryRole=protagonist

dracula_antagonist 1/1 Dracula primaryRole=antagonist

min_4_characters 1/1 9 characters

harker_all_chapters 1/1 Harker in chapters 1-4

three_women_supporting 1/1 three women / weird sisters extracted

Counts (model vs gold)

2/2

How many characters the model extracted, split by role, compared to gold. Both protagonist and antagonist roles must be detected; the counts give an at-a-glance sense of over- or under-extraction.

Protagonist detected 1/1 Harker

Antagonist detected 1/1 Dracula

modelgoldΔ

characters94+5

main32+1

supporting62+4

findings

no discriminator issues