Phase 1: Benchmarks Phase 2: Hardware Guide Phase 3: Setup Kit ↗
105 tests • 5 models • Quality scored

Local vs Cloud
Model Showdown

Five LLMs — 2 local Ollama + 3 Claude API — tested across 21 coding prompts on Apple M2 Max. Speed, quality, and cost compared.

Tests 105
Models 2 local + 3 cloud
Hardware M2 Max • 32GB
API Cost $2.69
Quality Judge Claude Sonnet 4.6

The Five Contenders

Two free local models vs three paid Claude models. Is the cloud premium worth it?

⬤ Local Models (Free • Ollama)
👑
qwen3-coder
30B MoE • Free • Offline
Speed48.8 tok/s
Quality7.48
TTFT1.1s
qwen2.5-coder:14b
14B Dense • Free • Offline
Speed15.6 tok/s
Quality6.64
TTFT1.5s
⬤ Cloud Models (Paid • Anthropic API)
Claude Haiku 4.5
Cloud • $0.28
Speed169.7 tok/s
Quality8.25
TTFT0.5s
Claude Sonnet 4.6
Cloud • $0.93
Speed77.7 tok/s
Quality8.59
TTFT1.0s
🏆
Claude Opus 4.6
Cloud • $1.48
Speed76.6 tok/s
Quality8.65
TTFT1.8s

Generation Speed

Tokens per second across all 21 prompts. Cloud models have a massive infrastructure advantage.

Tokens per Second

Average generation throughput — higher is better

Time to First Token

How quickly each model starts responding

Average Response Time

Total end-to-end seconds per prompt

Output Quality Scores

Every output scored by Claude Sonnet on correctness (40%), completeness (35%), and clarity (25%). Claude self-scores are flagged for bias.

Quality vs Speed Trade-off

The holy grail: top-right corner = fast AND high quality. Bubble size = total tokens output.

Quality Scores (Weighted)

Average quality score out of 10

Cost per Quality Point

API cost divided by quality score — lower is better

Performance by Task Type

Quality scores across 7 coding task categories. Where does each model shine?

Quality Heatmap — Score by Category

Color intensity = quality score. Green = high, red = low.

Category Quality Radar

Quality scores across all 7 task types per model

Complete Comparison

Every metric, every model.

ModelTypeTPSTTFTAvg TimeQualityTokensCost

Key Insights

What do 105 tests across 5 models reveal?

🏆

Cloud Wins on Quality

All 3 Claude models scored 8.25+ quality. Opus (8.65) edges Sonnet (8.59) by a hair. The best local model (qwen3-coder at 7.48) trails Haiku by 0.77 points — meaningful at scale.

Haiku is the Speed King

At 169.7 tok/s, Claude Haiku 4.5 is 3.5x faster than qwen3-coder and delivers near-Opus quality (8.25 vs 8.65) at 19% of the cost. The best value proposition in this benchmark by far.

💰

Local = Free but Slower

qwen3-coder delivers 7.48 quality at 0 cost — solid for offline/private use. But Haiku at $0.28 for 21 prompts gets you +0.77 quality and 3.5x speed. The gap is real.

🧠

Sonnet = Best Debugger

Claude Sonnet 4.6 scored 9.60 on Debugging & Reasoning — the highest single-category score of any model. For code review and bug-finding, it's unmatched at $0.93.

💻

qwen3-coder: Local Champion

The MoE architecture shines on Apple Silicon — 48.8 tok/s, 1.1s TTFT, and 7.48 quality. Runs 100% offline on your MacBook with zero internet. Privacy-first, air-gapped, and completely free.

Bias Caveat

63 of 105 scores involve Claude judging Claude. These are flagged but may inflate cloud scores. A truly fair comparison would need an independent judge (e.g., GPT-4 or human reviewers).

What's Next

Your Local Models Are Fast. Can They Be Faster?

48.8
Local Champion TPS
vs
 
169.7
Cloud Champion TPS

The 3.5x speed gap is real — but it's not about the model. It's about memory bandwidth. Better hardware can close that gap.

Explore the Hardware Guide →

Keyboard Shortcuts

General
Show this help?
Cycle themeT
Command palette⌘K
Close overlayEsc
Navigation
Scroll to topHome 0
Jump to section 1-616
Pages
Landing pageL
Hardware guideN
Setup KitM
Open GitHubG
Link copied!
? shortcuts  ·  ⌘K command palette  ·  T change theme