TalePal LLM Model Benchmark

Fixture Dracula — first 4 chapters Scored vs hand‑built golden reference Goal quality & speed

Private, on‑device analysis — at cloud quality

The cloud frontier still leads on raw quality: gpt-4.1-mini tops this edition at 96%. But the real story is local. Running fully on‑device, Gemma‑4 E4B — only about 4 billion effective parameters — lands right alongside the cloud at 89%, passing every must‑have gate. Privacy with essentially no quality penalty. The one cost is speed: roughly 7× slower than Gemini (~33 s vs ~4.8 s per turn on a MacBook Air (M3)). Tolerable — and analysis runs in the background, so you keep writing while it works.

Quality — local

89%

Gemma‑4 E4B on‑device, all 5 must‑haves passed — vs 96% for the cloud frontier.

Speed — the trade‑off

~7×

slower than Gemini (~33 s vs ~4.8 s/turn). But it runs in the background.

Privacy

on‑device

Nothing leaves your machine. No API keys, no upload — the whole novel stays local.

Bench results

Quality & speed — May 30, 2026

3 models · fixture dracula-4ch-bench

#	Model	Quality	Speed	Verdict	Must‑have
1	gpt-4.1-mini openai	96%	2.7 s/turn	pass	5/5	read full report →
2	gemini-2.5-flash gemini	89%	4.8 s/turn	flag	4/5	read full report →
3	google/gemma-4-e4b lmstudio · on‑device	89%	33 s/turn	pass	5/5	read full report →

Quality from the deep‑analysis bench (vs golden reference). Speed = avg response per chat turn from the fast bench, MacBook Air (M3).

How we score

One fixture, one golden reference

Every model analyses the same four chapters of Dracula through TalePal's full pipeline — character extraction, role classification, plot phases, quotes. The output is then graded against a hand‑built golden reference: the answer key a careful human reader would write.

Two must‑have gates sit on top of the weighted score: the protagonist and antagonist must be identified correctly. A model can write beautifully and still be flagged if it misses the villain. Every attributed quote is also verified against the source text — invented lines are caught, not rewarded.

Speed is measured separately, as average response time per chat turn, so the same machine's local‑vs‑cloud latency is comparable.

Weighted rubric

hallucination resistance30%
core correctness30%
coverage20%
cleanliness20%

Past benchmarks

May 30, 2026

full report →

fixture dracula-4ch-bench · MacBook Air (M3) · 1 model

lmstudio 89%

May 30, 2026

full report →

fixture dracula-4ch-bench · MacBook Air (M3) · 3 models

gemini 89% lmstudio 89% openai 96%

May 30, 2026

full report →

fixture dracula-4ch-bench · — · 1 model

gemini 98%