147 TESTS COMPLETE
TPS.SH
01 / RESULTS
Benchmark Results
All 7 models compared across speed, quality, and cost. 4 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.
|
Model |
Type |
TPS |
TTFT |
Avg Time |
Quality |
Cost |
| ⚡ |
Claude Haiku 4.5 |
Cloud |
169.7 |
0.5s |
16.6s |
8.25 |
$0.28 |
|
Claude Sonnet 4.6 |
Cloud |
77.7 |
1.0s |
39.9s |
8.59 |
$0.93 |
| 🏆 |
Claude Opus 4.6 |
Cloud |
76.6 |
1.8s |
40.4s |
8.65 |
$1.48 |
| 👑 |
qwen3-coder |
Local |
48.8 |
1.1s |
37.8s |
7.48 |
Free |
|
qwen2.5-coder:14b |
Local |
15.6 |
1.5s |
68.2s |
6.64 |
Free |
|
deepseek-r1:14b |
Local |
14.6 |
70.2s |
137.0s |
5.89 |
Free |
|
glm-4.7-flash |
Local |
10.2 |
54.8s |
229.5s |
5.30 |
Free |
⚡
Fastest Model
Claude Haiku 4.5 at 169.7 tok/s — 3.5x faster than the best local model. Best value cloud option at $0.28.
🏆
Highest Quality
Claude Opus 4.6 scores 8.65/10 — but Sonnet (8.59) nearly matches at 63% the cost. Diminishing returns at the top.
👑
Best Local
qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.
02 / ARCHITECTURE
Benchmark Pipeline
Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.
📄
Prompt Bank
21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.
21 prompts · 7 categories
⚙
Runner
Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.
Ollama + Anthropic adapters
⚖
Judge
Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.
weighted scoring · bias flags
📊
Reports
Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.
docx + pptx + React + web
03 / PROMPT CATEGORIES
7 Coding Task Types
Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.
$ cat code_generation.yaml
Code Generation
3 prompts · Write new code from spec
$ cat debugging_reasoning.yaml
Debugging & Reasoning
3 prompts · Find and fix bugs
$ cat refactoring.yaml
Refactoring
3 prompts · Improve existing code
$ cat explanation_teaching.yaml
Explanation & Teaching
3 prompts · Explain concepts clearly
$ cat short_quick.yaml
Short Quick Tasks
3 prompts · Fast utility tasks
$ cat long_complex.yaml
Long Complex Research
3 prompts · Deep architecture tasks
$ cat tool_calling.yaml
Tool Calling / Agentic
3 prompts · Agentic tool use
04 / KEY FINDINGS
What 147 Tests Reveal
The data tells a nuanced story about local vs cloud tradeoffs.
Cloud wins on quality, but diminishing returns at the top
All 3 Claude models score 8.25+. Opus (8.65) barely edges Sonnet (8.59), making Sonnet the smart buy at 63% the cost. The quality gap between best local (7.48) and worst cloud (8.25) is meaningful at scale.
+0.77 quality gap
MoE architecture dominates local inference
qwen3-coder (30B MoE) runs at 48.8 tok/s vs qwen2.5-coder (14B dense) at 15.6 tok/s — 3.1x faster despite being 2x larger. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.
3.1x speed advantage
Claude-judging-Claude bias caveat
62 of 147 quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).
62/147 flagged
Total cost: $3.95 for the entire benchmark
Cloud benchmark run: $2.69. Quality judging via Sonnet: $1.26. That's 147 tests across 7 models including full quality scoring. Local models contribute zero to the bill.
$3.95 all-in
05 / CLI
Command-Line Interface
8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.
$
python -m llm_bench run
$
python -m llm_bench run --hardware-metrics
$
python -m llm_bench run -m qwen3-coder -c code_generation
$
python -m llm_bench judge <run_id>
$
python -m llm_bench analyze <run_id>
$
python -m llm_bench report <run_id>
$
python -m llm_bench export <run_id>
$
python -m llm_bench hardware-report
$
python -m llm_bench cost-estimate
$
python -m llm_bench list-runs
06 / TECH STACK
Built With
Python-first toolkit with React dashboard and interactive comparison websites.
Python + Typer
CLI framework
Ollama
Local model runtime
Anthropic API
Cloud model access
React + Recharts
Dashboard UI
Chart.js
Comparison website charts
python-docx
Word report generation
python-pptx
PowerPoint generation
Dive Into the Data
Interactive Chart.js visualizations, hardware analysis, and the full comparison.