105 tests • 5 models • Quality scored

Local vs Cloud
Model Showdown

Five LLMs — 2 local Ollama + 3 Claude API — tested across 21 coding prompts on Apple M2 Max. Speed, quality, and cost compared.

Tests 105

Models 2 local + 3 cloud

Hardware M2 Max • 32GB

API Cost $2.69

Quality Judge Claude Sonnet 4.6

01 / Overview

The Five Contenders

Two free local models vs three paid Claude models. Is the cloud premium worth it?

⬤ Local Models (Free • Ollama)

👑

qwen3-coder

30B MoE • Free • Offline

Speed48.8 tok/s

Quality7.48

TTFT1.1s

qwen2.5-coder:14b

14B Dense • Free • Offline

Speed15.6 tok/s

Quality6.64

TTFT1.5s

⬤ Cloud Models (Paid • Anthropic API)

⚡

Claude Haiku 4.5

Cloud • $0.28

Speed169.7 tok/s

Quality8.25

TTFT0.5s

Claude Sonnet 4.6

Cloud • $0.93

Speed77.7 tok/s

Quality8.59

TTFT1.0s

🏆

Claude Opus 4.6

Cloud • $1.48

Speed76.6 tok/s

Quality8.65

TTFT1.8s

02 / Speed

Generation Speed

Tokens per second across all 21 prompts. Cloud models have a massive infrastructure advantage.

Tokens per Second

Average generation throughput — higher is better

Time to First Token

How quickly each model starts responding

Average Response Time

Total end-to-end seconds per prompt

03 / Quality

Output Quality Scores

Every output scored by Claude Sonnet on correctness (40%), completeness (35%), and clarity (25%). Claude self-scores are flagged for bias.

Quality vs Speed Trade-off

The holy grail: top-right corner = fast AND high quality. Bubble size = total tokens output.

Quality Scores (Weighted)

Average quality score out of 10

Cost per Quality Point

API cost divided by quality score — lower is better

04 / Categories

Performance by Task Type

Quality scores across 7 coding task categories. Where does each model shine?

Quality Heatmap — Score by Category

Color intensity = quality score. Green = high, red = low.

Category Quality Radar

Quality scores across all 7 task types per model

06 / Takeaways

Key Insights

What do 105 tests across 5 models reveal?

🏆

Cloud Wins on Quality

All 3 Claude models scored 8.25+ quality. Opus (8.65) edges Sonnet (8.59) by a hair. The best local model (qwen3-coder at 7.48) trails Haiku by 0.77 points — meaningful at scale.

⚡

Haiku is the Speed King

At 169.7 tok/s, Claude Haiku 4.5 is 3.5x faster than qwen3-coder and delivers near-Opus quality (8.25 vs 8.65) at 19% of the cost. The best value proposition in this benchmark by far.

💰

Local = Free but Slower

qwen3-coder delivers 7.48 quality at 0 cost — solid for offline/private use. But Haiku at $0.28 for 21 prompts gets you +0.77 quality and 3.5x speed. The gap is real.

🧠

Sonnet = Best Debugger

Claude Sonnet 4.6 scored 9.60 on Debugging & Reasoning — the highest single-category score of any model. For code review and bug-finding, it's unmatched at $0.93.

💻

qwen3-coder: Local Champion

The MoE architecture shines on Apple Silicon — 48.8 tok/s, 1.1s TTFT, and 7.48 quality. Runs 100% offline on your MacBook with zero internet. Privacy-first, air-gapped, and completely free.

⚠

Bias Caveat

63 of 105 scores involve Claude judging Claude. These are flagged but may inflate cloud scores. A truly fair comparison would need an independent judge (e.g., GPT-4 or human reviewers).

Local vs Cloud
Model Showdown

The Five Contenders

Generation Speed

Tokens per Second

Time to First Token

Average Response Time

Output Quality Scores

Quality vs Speed Trade-off

Quality Scores (Weighted)

Cost per Quality Point

Performance by Task Type

Quality Heatmap — Score by Category

Category Quality Radar

Complete Comparison

Key Insights

Cloud Wins on Quality

Haiku is the Speed King

Local = Free but Slower

Sonnet = Best Debugger

qwen3-coder: Local Champion

Bias Caveat

Your Local Models Are Fast. Can They Be Faster?

Keyboard Shortcuts

Local vs CloudModel Showdown

The Five Contenders

Generation Speed

Tokens per Second

Time to First Token

Average Response Time

Output Quality Scores

Quality vs Speed Trade-off

Quality Scores (Weighted)

Cost per Quality Point

Performance by Task Type

Quality Heatmap — Score by Category

Category Quality Radar

Complete Comparison

Key Insights

Cloud Wins on Quality

Haiku is the Speed King

Local = Free but Slower

Sonnet = Best Debugger

qwen3-coder: Local Champion

Bias Caveat

Your Local Models Are Fast. Can They Be Faster?

Keyboard Shortcuts

Local vs Cloud
Model Showdown