147 TESTS COMPLETE

TPS.SH

7 Models

147 Tests

21 Prompts

$3.95 Total Cost

M2 Max Hardware

Landing Interactive Results Hardware Guide Setup Kit ↗

01 / RESULTS

Benchmark Results

All 7 models compared across speed, quality, and cost. 4 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.

	Model	Type	TPS	TTFT	Avg Time	Quality	Cost
⚡	Claude Haiku 4.5	Cloud	169.7	0.5s	16.6s	8.25	$0.28
	Claude Sonnet 4.6	Cloud	77.7	1.0s	39.9s	8.59	$0.93
🏆	Claude Opus 4.6	Cloud	76.6	1.8s	40.4s	8.65	$1.48
👑	qwen3-coder	Local	48.8	1.1s	37.8s	7.48	Free
	qwen2.5-coder:14b	Local	15.6	1.5s	68.2s	6.64	Free
	deepseek-r1:14b	Local	14.6	70.2s	137.0s	5.89	Free
	glm-4.7-flash	Local	10.2	54.8s	229.5s	5.30	Free

⚡

Fastest Model

Claude Haiku 4.5 at 169.7 tok/s — 3.5x faster than the best local model. Best value cloud option at $0.28.

🏆

Highest Quality

Claude Opus 4.6 scores 8.65/10 — but Sonnet (8.59) nearly matches at 63% the cost. Diminishing returns at the top.

👑

Best Local

qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.

02 / ARCHITECTURE

Benchmark Pipeline

Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.

📄

Prompt Bank

21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.

21 prompts · 7 categories

⚙

Runner

Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.

Ollama + Anthropic adapters

⚖

Judge

Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.

weighted scoring · bias flags

📊

Reports

Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.

docx + pptx + React + web

03 / PROMPT CATEGORIES

7 Coding Task Types

Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.

$ cat code_generation.yaml

Code Generation

3 prompts · Write new code from spec

$ cat debugging_reasoning.yaml

Debugging & Reasoning

3 prompts · Find and fix bugs

$ cat refactoring.yaml

Refactoring

3 prompts · Improve existing code

$ cat explanation_teaching.yaml

Explanation & Teaching

3 prompts · Explain concepts clearly

$ cat short_quick.yaml

Short Quick Tasks

3 prompts · Fast utility tasks

$ cat long_complex.yaml

Long Complex Research

3 prompts · Deep architecture tasks

$ cat tool_calling.yaml

Tool Calling / Agentic

3 prompts · Agentic tool use

04 / KEY FINDINGS

What 147 Tests Reveal

The data tells a nuanced story about local vs cloud tradeoffs.

Cloud wins on quality, but diminishing returns at the top

All 3 Claude models score 8.25+. Opus (8.65) barely edges Sonnet (8.59), making Sonnet the smart buy at 63% the cost. The quality gap between best local (7.48) and worst cloud (8.25) is meaningful at scale.

+0.77 quality gap

MoE architecture dominates local inference

qwen3-coder (30B MoE) runs at 48.8 tok/s vs qwen2.5-coder (14B dense) at 15.6 tok/s — 3.1x faster despite being 2x larger. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.

3.1x speed advantage

Claude-judging-Claude bias caveat

62 of 147 quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).

62/147 flagged

Total cost: $3.95 for the entire benchmark

Cloud benchmark run: $2.69. Quality judging via Sonnet: $1.26. That's 147 tests across 7 models including full quality scoring. Local models contribute zero to the bill.

$3.95 all-in

05 / CLI

Command-Line Interface

8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.

llm-bench — terminal

$ python -m llm_bench run # Run benchmarks against all models

$ python -m llm_bench run --hardware-metrics # With GPU/thermal capture

$ python -m llm_bench run -m qwen3-coder -c code_generation # Filter by model and category

$ python -m llm_bench judge <run_id> # Score outputs with Claude Sonnet

$ python -m llm_bench analyze <run_id> # Generate analysis and rankings

$ python -m llm_bench report <run_id> # Generate Word + PowerPoint reports

$ python -m llm_bench export <run_id> # Export dashboard-ready JSON

$ python -m llm_bench hardware-report # Generate hardware infrastructure guide

$ python -m llm_bench cost-estimate # Estimate cost for a full run

$ python -m llm_bench list-runs # List all benchmark runs

06 / TECH STACK

Built With

Python-first toolkit with React dashboard and interactive comparison websites.

Python + Typer

CLI framework

Ollama

Local model runtime

Anthropic API

Cloud model access

React + Recharts

Dashboard UI

Chart.js

Comparison website charts

python-docx

Word report generation

python-pptx

PowerPoint generation

Rich

Terminal output

Dive Into the Data

Interactive Chart.js visualizations, hardware analysis, and the full comparison.

INTERACTIVE RESULTS → HARDWARE GUIDE → MAC SETUP KIT ↗

Keyboard Shortcuts