Deep Analysis Bench — Detail
lmstudio google/gemma-4-e4b
Rubric breakdown the four weighted components of the overall score
Each rubric is computed independently and weighted into the overall percent at the top. Weights are deliberately simple — tune as we learn. See the comparison page for what each rubric tests.
Quality (text-anchored)
1/2The two metrics whose truth is verifiable by string-matching against the source chapters: quote authenticity (did every attributed quote actually appear?) and chapter-occurrence (did the model place each character in the right chapters?).
Must-have gate
5/5A fixed list of facts that must be correct because downstream chat features assume them. A single failure here flags the model regardless of overall score.
Counts (model vs gold)
2/2How many characters the model extracted, split by role, compared to gold. Both protagonist and antagonist roles must be detected; the counts give an at-a-glance sense of over- or under-extraction.