tps.sh

01 / RESULTS

Benchmark Results

All 12 models compared across speed, quality, and cost. 9 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.

	Model	Type	TPS	TTFT	Avg Time	Quality	Cost
⚡	Claude Haiku 4.5	Cloud	167.8	0.7s	16.8s	8.35	$0.27
	Claude Sonnet 4.6	Cloud	78.9	1.5s	39.9s	8.42	$0.94
	Claude Opus 4.6	Cloud	74.7	1.7s	41.5s	8.61	$1.49
👑	qwen3-coder	Local	48.8	1.1s	37.8s	7.48	Free
	gemma4:26b	Local	39.2	24.5s	65.9s	8.36	Free
	phi4:14b	Local	17.9	1.5s	56.5s	7.21	Free
	qwen2.5-coder:14b	Local	15.6	1.5s	68.2s	6.64	Free
	deepseek-r1:14b	Local	14.6	70.2s	137.0s	5.89	Free
	glm-4.7-flash	Local	10.2	54.8s	229.5s	5.30	Free
	qwen2.5-coder:32b	Local	7.89	3.4s	135.7s	7.24	Free
🏆	gemma4:31b	Local	7.71	108.9s	308.6s	8.87	Free
	qwen3:32b	Local	6.92	171.9s	497.1s	5.76	Free

⚡

Fastest Model

Claude Haiku 4.5 at 167.8 tok/s — 3.4x faster than the best local model. Best value cloud option at $0.27.

🏆

Highest Quality

gemma4:31b scores 8.87/10 — the highest quality of any model, local or cloud. A 31B local model outscoring Claude Opus (8.61) is the Phase 3 headline.

👑

Best Local Speed

qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.

02 / ARCHITECTURE

Benchmark Pipeline

Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.

📄

Prompt Bank

21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.

21 prompts · 7 categories

⚙

Runner

Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.

Ollama + Anthropic adapters

⚖

Judge

Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.

weighted scoring · bias flags

📊

Reports

Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.

docx + pptx + React + web

03 / PROMPT CATEGORIES

7 Coding Task Types

Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.

$ cat code_generation.yaml

Code Generation

3 prompts · Write new code from spec

$ cat debugging_reasoning.yaml

Debugging & Reasoning

3 prompts · Find and fix bugs

$ cat refactoring.yaml

Refactoring

3 prompts · Improve existing code

$ cat explanation_teaching.yaml

Explanation & Teaching

3 prompts · Explain concepts clearly

$ cat short_quick.yaml

Short Quick Tasks

3 prompts · Fast utility tasks

$ cat long_complex.yaml

Long Complex Research

3 prompts · Deep architecture tasks

$ cat tool_calling.yaml

Tool Calling / Agentic

3 prompts · Agentic tool use

04 / KEY FINDINGS

What 168 Tests Reveal

Phase 3 added 5 new local models and re-ran all 3 cloud models. The data tells a nuanced story about local vs cloud tradeoffs.

A local model takes the quality crown

gemma4:31b scores 8.87/10 — the highest quality of any model tested, surpassing Claude Opus (8.61) and Sonnet (8.42). A free, offline, 31B local model outperforming the best cloud models is the Phase 3 headline result.

8.87 > 8.61 quality

MoE architecture still dominates local speed

qwen3-coder (30B MoE) runs at 48.8 tok/s vs gemma4:26b at 39.2 tok/s and qwen2.5-coder:14b at 15.6 tok/s. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.

48.8 tok/s best local

gemma4:26b is the new quality-speed sweet spot

At 39.2 tok/s and 8.36 quality, gemma4:26b offers cloud-tier quality (above Haiku's 8.35) at local speeds with zero cost. It bridges the gap between qwen3-coder speed and Claude quality.

8.36 quality at 39.2 TPS

Claude-judging-Claude bias caveat

Cloud quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).

bias flagged

05 / CLI

Command-Line Interface

8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.

llm-bench — terminal

$ python -m llm_bench run # Run benchmarks against all models

$ python -m llm_bench run --hardware-metrics # With GPU/thermal capture

$ python -m llm_bench run -m qwen3-coder -c code_generation # Filter by model and category

$ python -m llm_bench judge <run_id> # Score outputs with Claude Sonnet

$ python -m llm_bench analyze <run_id> # Generate analysis and rankings

$ python -m llm_bench report <run_id> # Generate Word + PowerPoint reports

$ python -m llm_bench export <run_id> # Export dashboard-ready JSON

$ python -m llm_bench hardware-report # Generate hardware infrastructure guide

$ python -m llm_bench cost-estimate # Estimate cost for a full run

$ python -m llm_bench list-runs # List all benchmark runs

06 / TECH STACK

Built With

Python-first toolkit with React dashboard and interactive comparison websites.

Python + Typer

CLI framework

Ollama

Local model runtime

Anthropic API

Cloud model access

React + Recharts

Dashboard UI

Chart.js

Comparison website charts

python-docx

Word report generation

python-pptx

PowerPoint generation

Rich

Terminal output

Keyboard Shortcuts