TalePal Deep-Bench — 2026-05-30T11:27:31.265Z

lmstudio google/gemma-4-e4b

Pros no hallucinated quotes · protagonist + antagonist correct · all 5 must-haves · no training-data leak

Cons duplicate: Peter Hawkins = Mr. Peter Hawkins · mentioned-only as present: Peter Hawkins (role=supporting) · chapter-occurrence 72.7% · over-classifies main (6 vs 2)

Overall Rating

/100

Rubric breakdown the four weighted components of the overall score

Each rubric is computed independently and weighted into the overall percent at the top. Weights are deliberately simple — tune as we learn. See the comparison page for what each rubric tests.

Hallucination resistance 30% 100

Core correctness 30% 100

Coverage 20% 73

Cleanliness 20% 70

Quality (text-anchored)

1/2

The two metrics whose truth is verifiable by string-matching against the source chapters: quote authenticity (did every attributed quote actually appear?) and chapter-occurrence (did the model place each character in the right chapters?).

Quote authenticity 45/45 100.0%

Chapter-occurrence 8/11 72.7%

findings

all attributed quotes verified in source

Must-have gate

5/5

A fixed list of facts that must be correct because downstream chat features assume them. A single failure here flags the model regardless of overall score.

harker_protagonist 1/1 Harker primaryRole=protagonist

dracula_antagonist 1/1 Dracula primaryRole=antagonist

min_4_characters 1/1 10 characters

harker_all_chapters 1/1 Harker in chapters 1-4

three_women_supporting 1/1 three women / weird sisters extracted

Counts (model vs gold)

2/2

How many characters the model extracted, split by role, compared to gold. Both protagonist and antagonist roles must be detected; the counts give an at-a-glance sense of over- or under-extraction.

Protagonist detected 1/1 Harker

Antagonist detected 1/1 Dracula

modelgoldΔ

characters104+6

main62+4

supporting42+2

findings

mentioned-only as present: Peter Hawkins (role=supporting)duplicate: Peter Hawkins = Mr. Peter Hawkins