Keyboard Shortcuts

Navigate faster with your keyboard
General
Show this help?
Command paletteK
Cycle themeT
Close overlayEsc
Navigation
Scroll to topHome / 0
Go to section 1-717
Open GitHub repoG
Pages
Interactive ResultsI
Hardware GuideH
Mac Setup KitM
147 TESTS COMPLETE

TPS.SH

7 Models
147 Tests
21 Prompts
$3.95 Total Cost
M2 Max Hardware
Landing Interactive Results Hardware Guide Setup Kit ↗
Benchmark Results
All 7 models compared across speed, quality, and cost. 4 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.
Model Type TPS TTFT Avg Time Quality Cost
Claude Haiku 4.5 Cloud 169.7 0.5s 16.6s 8.25 $0.28
Claude Sonnet 4.6 Cloud 77.7 1.0s 39.9s 8.59 $0.93
🏆 Claude Opus 4.6 Cloud 76.6 1.8s 40.4s 8.65 $1.48
👑 qwen3-coder Local 48.8 1.1s 37.8s 7.48 Free
qwen2.5-coder:14b Local 15.6 1.5s 68.2s 6.64 Free
deepseek-r1:14b Local 14.6 70.2s 137.0s 5.89 Free
glm-4.7-flash Local 10.2 54.8s 229.5s 5.30 Free

Fastest Model

Claude Haiku 4.5 at 169.7 tok/s — 3.5x faster than the best local model. Best value cloud option at $0.28.
🏆

Highest Quality

Claude Opus 4.6 scores 8.65/10 — but Sonnet (8.59) nearly matches at 63% the cost. Diminishing returns at the top.
👑

Best Local

qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.
Benchmark Pipeline
Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.
📄

Prompt Bank

21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.
21 prompts · 7 categories

Runner

Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.
Ollama + Anthropic adapters

Judge

Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.
weighted scoring · bias flags
📊

Reports

Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.
docx + pptx + React + web
7 Coding Task Types
Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.
$ cat code_generation.yaml

Code Generation

3 prompts · Write new code from spec
$ cat debugging_reasoning.yaml

Debugging & Reasoning

3 prompts · Find and fix bugs
$ cat refactoring.yaml

Refactoring

3 prompts · Improve existing code
$ cat explanation_teaching.yaml

Explanation & Teaching

3 prompts · Explain concepts clearly
$ cat short_quick.yaml

Short Quick Tasks

3 prompts · Fast utility tasks
$ cat long_complex.yaml

Long Complex Research

3 prompts · Deep architecture tasks
$ cat tool_calling.yaml

Tool Calling / Agentic

3 prompts · Agentic tool use
What 147 Tests Reveal
The data tells a nuanced story about local vs cloud tradeoffs.

Cloud wins on quality, but diminishing returns at the top

All 3 Claude models score 8.25+. Opus (8.65) barely edges Sonnet (8.59), making Sonnet the smart buy at 63% the cost. The quality gap between best local (7.48) and worst cloud (8.25) is meaningful at scale.

+0.77 quality gap

MoE architecture dominates local inference

qwen3-coder (30B MoE) runs at 48.8 tok/s vs qwen2.5-coder (14B dense) at 15.6 tok/s — 3.1x faster despite being 2x larger. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.

3.1x speed advantage

Claude-judging-Claude bias caveat

62 of 147 quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).

62/147 flagged

Total cost: $3.95 for the entire benchmark

Cloud benchmark run: $2.69. Quality judging via Sonnet: $1.26. That's 147 tests across 7 models including full quality scoring. Local models contribute zero to the bill.

$3.95 all-in
Command-Line Interface
8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.
llm-bench — terminal
$ python -m llm_bench run # Run benchmarks against all models
$ python -m llm_bench run --hardware-metrics # With GPU/thermal capture
$ python -m llm_bench run -m qwen3-coder -c code_generation # Filter by model and category
$ python -m llm_bench judge <run_id> # Score outputs with Claude Sonnet
$ python -m llm_bench analyze <run_id> # Generate analysis and rankings
$ python -m llm_bench report <run_id> # Generate Word + PowerPoint reports
$ python -m llm_bench export <run_id> # Export dashboard-ready JSON
$ python -m llm_bench hardware-report # Generate hardware infrastructure guide
$ python -m llm_bench cost-estimate # Estimate cost for a full run
$ python -m llm_bench list-runs # List all benchmark runs
Built With
Python-first toolkit with React dashboard and interactive comparison websites.
Python + Typer
CLI framework
Ollama
Local model runtime
Anthropic API
Cloud model access
React + Recharts
Dashboard UI
Chart.js
Comparison website charts
python-docx
Word report generation
python-pptx
PowerPoint generation
Rich
Terminal output

Dive Into the Data

Interactive Chart.js visualizations, hardware analysis, and the full comparison.

? shortcuts  ·  ⌘K command palette  ·  T change theme