Keyboard Shortcuts

Navigate faster with your keyboard
General
Show this help?
Command paletteK
Cycle themeT
Close overlayEsc
Navigation
Scroll to topHome / 0
Go to section 1-717
Open GitHub repoG
Pages
Interactive ResultsI
Hardware GuideH
Mac Setup KitM
168 TESTS COMPLETE

TPS.SH

12 Models
168 Tests
21 Prompts
$3.95 Total Cost
M2 Max Hardware
Landing Interactive Results Hardware Guide Setup Kit ↗
Benchmark Results
All 12 models compared across speed, quality, and cost. 9 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.
Model Type TPS TTFT Avg Time Quality Cost
Claude Haiku 4.5 Cloud 167.8 0.7s 16.8s 8.35 $0.27
Claude Sonnet 4.6 Cloud 78.9 1.5s 39.9s 8.42 $0.94
Claude Opus 4.6 Cloud 74.7 1.7s 41.5s 8.61 $1.49
👑 qwen3-coder Local 48.8 1.1s 37.8s 7.48 Free
gemma4:26b Local 39.2 24.5s 65.9s 8.36 Free
phi4:14b Local 17.9 1.5s 56.5s 7.21 Free
qwen2.5-coder:14b Local 15.6 1.5s 68.2s 6.64 Free
deepseek-r1:14b Local 14.6 70.2s 137.0s 5.89 Free
glm-4.7-flash Local 10.2 54.8s 229.5s 5.30 Free
qwen2.5-coder:32b Local 7.89 3.4s 135.7s 7.24 Free
🏆 gemma4:31b Local 7.71 108.9s 308.6s 8.87 Free
qwen3:32b Local 6.92 171.9s 497.1s 5.76 Free

Fastest Model

Claude Haiku 4.5 at 167.8 tok/s — 3.4x faster than the best local model. Best value cloud option at $0.27.
🏆

Highest Quality

gemma4:31b scores 8.87/10 — the highest quality of any model, local or cloud. A 31B local model outscoring Claude Opus (8.61) is the Phase 3 headline.
👑

Best Local Speed

qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.
Benchmark Pipeline
Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.
📄

Prompt Bank

21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.
21 prompts · 7 categories

Runner

Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.
Ollama + Anthropic adapters

Judge

Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.
weighted scoring · bias flags
📊

Reports

Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.
docx + pptx + React + web
7 Coding Task Types
Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.
$ cat code_generation.yaml

Code Generation

3 prompts · Write new code from spec
$ cat debugging_reasoning.yaml

Debugging & Reasoning

3 prompts · Find and fix bugs
$ cat refactoring.yaml

Refactoring

3 prompts · Improve existing code
$ cat explanation_teaching.yaml

Explanation & Teaching

3 prompts · Explain concepts clearly
$ cat short_quick.yaml

Short Quick Tasks

3 prompts · Fast utility tasks
$ cat long_complex.yaml

Long Complex Research

3 prompts · Deep architecture tasks
$ cat tool_calling.yaml

Tool Calling / Agentic

3 prompts · Agentic tool use
What 168 Tests Reveal
Phase 3 added 5 new local models and re-ran all 3 cloud models. The data tells a nuanced story about local vs cloud tradeoffs.

A local model takes the quality crown

gemma4:31b scores 8.87/10 — the highest quality of any model tested, surpassing Claude Opus (8.61) and Sonnet (8.42). A free, offline, 31B local model outperforming the best cloud models is the Phase 3 headline result.

8.87 > 8.61 quality

MoE architecture still dominates local speed

qwen3-coder (30B MoE) runs at 48.8 tok/s vs gemma4:26b at 39.2 tok/s and qwen2.5-coder:14b at 15.6 tok/s. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.

48.8 tok/s best local

gemma4:26b is the new quality-speed sweet spot

At 39.2 tok/s and 8.36 quality, gemma4:26b offers cloud-tier quality (above Haiku's 8.35) at local speeds with zero cost. It bridges the gap between qwen3-coder speed and Claude quality.

8.36 quality at 39.2 TPS

Claude-judging-Claude bias caveat

Cloud quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).

bias flagged
Command-Line Interface
8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.
llm-bench — terminal
$ python -m llm_bench run # Run benchmarks against all models
$ python -m llm_bench run --hardware-metrics # With GPU/thermal capture
$ python -m llm_bench run -m qwen3-coder -c code_generation # Filter by model and category
$ python -m llm_bench judge <run_id> # Score outputs with Claude Sonnet
$ python -m llm_bench analyze <run_id> # Generate analysis and rankings
$ python -m llm_bench report <run_id> # Generate Word + PowerPoint reports
$ python -m llm_bench export <run_id> # Export dashboard-ready JSON
$ python -m llm_bench hardware-report # Generate hardware infrastructure guide
$ python -m llm_bench cost-estimate # Estimate cost for a full run
$ python -m llm_bench list-runs # List all benchmark runs
Built With
Python-first toolkit with React dashboard and interactive comparison websites.
Python + Typer
CLI framework
Ollama
Local model runtime
Anthropic API
Cloud model access
React + Recharts
Dashboard UI
Chart.js
Comparison website charts
python-docx
Word report generation
python-pptx
PowerPoint generation
Rich
Terminal output

Dive Into the Data

Interactive Chart.js visualizations, hardware analysis, and the full comparison.

? shortcuts  ·  ⌘K command palette  ·  T change theme