TalePal Model Benchmark
← Back to comparison

Deep Analysis Bench — Detail

fixture dracula-4ch-bench · 2026-05-30T09:24:55.345Z · openai

openai gpt-4.1-mini

Pros no hallucinated quotes · protagonist + antagonist correct · all 5 must-haves · no training-data leak
Cons chapter-occurrence 81.8% · over-classifies main (3 vs 2)
Overall Rating
96
/100

Rubric breakdown the four weighted components of the overall score

Each rubric is computed independently and weighted into the overall percent at the top. Weights are deliberately simple — tune as we learn. See the comparison page for what each rubric tests.

Hallucination resistance 30% 100
Core correctness 30% 100
Coverage 20% 82
Cleanliness 20% 100

Quality (text-anchored)

1/2

The two metrics whose truth is verifiable by string-matching against the source chapters: quote authenticity (did every attributed quote actually appear?) and chapter-occurrence (did the model place each character in the right chapters?).

Quote authenticity 44/44 100.0%
Chapter-occurrence 9/11 81.8%
findings
all attributed quotes verified in source

Must-have gate

5/5

A fixed list of facts that must be correct because downstream chat features assume them. A single failure here flags the model regardless of overall score.

harker_protagonist 1/1 Harker primaryRole=protagonist
dracula_antagonist 1/1 Dracula primaryRole=antagonist
min_4_characters 1/1 9 characters
harker_all_chapters 1/1 Harker in chapters 1-4
three_women_supporting 1/1 three women / weird sisters extracted

Counts (model vs gold)

2/2

How many characters the model extracted, split by role, compared to gold. Both protagonist and antagonist roles must be detected; the counts give an at-a-glance sense of over- or under-extraction.

Protagonist detected 1/1 Harker
Antagonist detected 1/1 Dracula
modelgoldΔ
characters94+5
main32+1
supporting62+4
findings
no discriminator issues