TalePal
Local‑first analysis · quality & speed

TalePal LLM Model Benchmark

Fixture  Dracula — first 4 chapters Scored vs  hand‑built golden reference Goal  quality & speed

Private, on‑device analysis — at cloud quality

The cloud frontier still leads on raw quality: gpt-4.1-mini tops this edition at 96%. But the real story is local. Running fully on‑device, Gemma‑4 E4B — only about 4 billion effective parameters — lands right alongside the cloud at 89%, passing every must‑have gate. Privacy with essentially no quality penalty. The one cost is speed: roughly 7× slower than Gemini (~33 s vs ~4.8 s per turn on a MacBook Air (M3)). Tolerable — and analysis runs in the background, so you keep writing while it works.

Quality — local
89%

Gemma‑4 E4B on‑device, all 5 must‑haves passed — vs 96% for the cloud frontier.

Speed — the trade‑off
~7×

slower than Gemini (~33 s vs ~4.8 s/turn). But it runs in the background.

Privacy
on‑device

Nothing leaves your machine. No API keys, no upload — the whole novel stays local.

Bench results

Quality & speed — May 30, 2026

3 models · fixture dracula-4ch-bench
#ModelQualitySpeedVerdictMust‑have
1 gpt-4.1-mini openai 96% 2.7 s/turn pass 5/5 read full report →
2 gemini-2.5-flash gemini 89% 4.8 s/turn flag 4/5 read full report →
3 google/gemma-4-e4b lmstudio · on‑device 89% 33 s/turn pass 5/5 read full report →
Quality from the deep‑analysis bench (vs golden reference). Speed = avg response per chat turn from the fast bench, MacBook Air (M3).
How we score

One fixture, one golden reference

Every model analyses the same four chapters of Dracula through TalePal's full pipeline — character extraction, role classification, plot phases, quotes. The output is then graded against a hand‑built golden reference: the answer key a careful human reader would write.

Two must‑have gates sit on top of the weighted score: the protagonist and antagonist must be identified correctly. A model can write beautifully and still be flagged if it misses the villain. Every attributed quote is also verified against the source text — invented lines are caught, not rewarded.

Speed is measured separately, as average response time per chat turn, so the same machine's local‑vs‑cloud latency is comparable.

Weighted rubric
  • hallucination resistance30%
  • core correctness30%
  • coverage20%
  • cleanliness20%
Past benchmarks