Five LLMs — 2 local Ollama + 3 Claude API — tested across 21 coding prompts on Apple M2 Max. Speed, quality, and cost compared.
Two free local models vs three paid Claude models. Is the cloud premium worth it?
Tokens per second across all 21 prompts. Cloud models have a massive infrastructure advantage.
Every output scored by Claude Sonnet on correctness (40%), completeness (35%), and clarity (25%). Claude self-scores are flagged for bias.
Quality scores across 7 coding task categories. Where does each model shine?
Every metric, every model.
| Model | Type | TPS | TTFT | Avg Time | Quality | Tokens | Cost |
|---|
What do 105 tests across 5 models reveal?
All 3 Claude models scored 8.25+ quality. Opus (8.65) edges Sonnet (8.59) by a hair. The best local model (qwen3-coder at 7.48) trails Haiku by 0.77 points — meaningful at scale.
At 169.7 tok/s, Claude Haiku 4.5 is 3.5x faster than qwen3-coder and delivers near-Opus quality (8.25 vs 8.65) at 19% of the cost. The best value proposition in this benchmark by far.
qwen3-coder delivers 7.48 quality at 0 cost — solid for offline/private use. But Haiku at $0.28 for 21 prompts gets you +0.77 quality and 3.5x speed. The gap is real.
Claude Sonnet 4.6 scored 9.60 on Debugging & Reasoning — the highest single-category score of any model. For code review and bug-finding, it's unmatched at $0.93.
The MoE architecture shines on Apple Silicon — 48.8 tok/s, 1.1s TTFT, and 7.48 quality. Runs 100% offline on your MacBook with zero internet. Privacy-first, air-gapped, and completely free.
63 of 105 scores involve Claude judging Claude. These are flagged but may inflate cloud scores. A truly fair comparison would need an independent judge (e.g., GPT-4 or human reviewers).
The 3.5x speed gap is real — but it's not about the model. It's about memory bandwidth. Better hardware can close that gap.
Explore the Hardware Guide →