Phase 1: Benchmarks Phase 2: Hardware Guide Phase 3: Setup Kit ↗
Hardware planning • Based on real benchmarks

On-Premise LLM
Infrastructure Guide

Memory Bandwidth is the Bottleneck

Understanding what drives local LLM performance, and choosing the right hardware for your workload.

Benchmarks 147 real tests
Hardware M2 Max 32GB
Models 7 tested
Framework Ollama + llama.cpp
From Benchmarks to Hardware

Phase 1 revealed the gap. Phase 2 explains why — and how to close it.

In Phase 1, we benchmarked 5 models across 21 coding prompts. The best local model (qwen3-coder at 48.8 tok/s) delivered solid quality for free — but cloud models ran 3.5x faster. Why? And can better hardware close that gap?

48.8
Best Local TPS
169.7
Best Cloud TPS
3.5x
Speed Gap
400
M2 Max GB/s
The answer is memory bandwidth. Scroll down to see the formula, the math, and the hardware that can push local TPS past 90.

Why Memory Bandwidth Determines TPS

Local LLM inference is memory-bound, not compute-bound. Every token generated requires reading the entire model from memory once. The speed of that read is your throughput ceiling.

The Fundamental Formula

TPS = (Memory Bandwidth GB/s / Model Size GB) x Efficiency
Efficiency factor: 0.5 - 0.7 typical for Ollama/llama.cpp on Apple Silicon
Memory Bandwidth = Pipe Diameter
400 GB/s on M2 Max
How fast data can flow from unified memory to the GPU/CPU. Wider pipe = more data per second = more tokens.
Model Size = Water Volume Per Token
15.3 GB for qwen3-coder Q4
Each token requires reading the full model weights. Bigger model = more data per token = slower generation.
VALIDATION AGAINST REAL DATA
Predicted: 400 / 15.3 x 0.6 = 15.7 TPS
Actual (dense equivalent): 48.8 TPS (MoE: only 8B active)
qwen3-coder uses MoE architecture (8B active of 30B total), so effective model size is ~5.2 GB. 400/5.2 x 0.6 = 46.2 TPS -- very close to actual 48.8 TPS.

Interactive TPS Projector

Adjust bandwidth, model size, and efficiency to project tokens-per-second for any hardware + model combination.

Memory Bandwidth 400 GB/s
Model Size 15.0 GB
Efficiency Factor 0.60
M2 Max Benchmark Reference
qwen3-coder (MoE, ~5.2 GB active)48.8 TPS actual
qwen2.5-coder:14b (7.4 GB Q4)15.6 TPS actual
deepseek-r1:14b (7.4 GB Q4)14.6 TPS actual
glm-4.7-flash (~5 GB Q4)10.2 TPS actual
Projected TPS
16.0
tokens / second
400 GB/s / 15.0 GB x 0.60

Apple Silicon Lineup

Six hardware configurations compared. Projected TPS based on the bandwidth formula with qwen3-coder (15.3 GB Q4, MoE ~5.2 GB active).

TPS Across Hardware

Projected tokens-per-second for qwen3-coder on each hardware configuration. The M2 Max bar shows actual benchmark data.

qwen3-coder Projected TPS by Hardware

Horizontal bars. Green = actual measured data. Blue = projected from bandwidth formula.

Can Your Hardware Run It?

Select a model and quantization to see which hardware configs can fit it in RAM, and the projected throughput.

RAM requirement includes ~2-4 GB overhead for KV cache and runtime. Models need to fit entirely in RAM for acceptable performance (no swap).

exo Cluster Estimator

exo lets you pool multiple Macs into a single inference cluster. Add Mac Mini nodes to see aggregate performance projections.

Mac Mini M4 Pro Nodes 2 nodes
Scaling efficiency: 95% for 2 nodes, 85% for 3-4, 75% for 5+ (network overhead). Based on exo/Thunderbolt clustering benchmarks.

Total Cost of Ownership

Local hardware amortized over 3 years at 25% utilization vs cloud API pricing. Where is the breakeven?

Local: Cost per 1M Tokens

Hardware amortized over 3 years, 25% utilization, electricity included

Cloud: API Pricing (per 1M output tokens)

Pay-per-use, no hardware investment, scales instantly

Breakeven Analysis: Tokens to Pay Off Hardware

How many million output tokens until local hardware costs less than each cloud tier. Lower = faster payoff.

What You Need to Know

Four critical facts about running LLMs locally on Apple Silicon.

💾

More RAM Does Not Mean Faster

RAM capacity determines the largest model you can run, not how fast it runs. A 70B model on a 192GB Mac runs at the same TPS as on a 512GB Mac (same bandwidth chip). RAM is about fit, not speed.

🧩

MoE Architecture Advantage

Mixture-of-Experts models like qwen3-coder activate only a fraction of parameters per token (8B of 30B). This means the effective bandwidth cost per token is much lower, giving 3x the TPS of a dense model at the same parameter count.

🍎

MLX is 30-50% Faster

Apple's MLX framework is optimized for Apple Silicon's unified memory architecture. Compared to Ollama/llama.cpp, MLX typically achieves 30-50% higher TPS on the same hardware by better utilizing the memory subsystem.

🌡

Thermal Throttling Matters

MacBook Pros throttle under sustained load, dropping 10-25% performance after 5-10 minutes. Mac Studio and Mac Pro have superior cooling. For 24/7 inference servers, desktop form factors are strongly preferred.

Which Hardware Should You Buy?

Answer four questions and get a personalized recommendation based on your budget, team size, model needs, and portability requirements.

Link copied!

Keyboard Shortcuts

General
Show this help?
Command palette⌘K
Cycle themet
Close overlayEsc
Navigation
Scroll to topHome or 0
Jump to section19
Open GitHubg
Pages
Phase 1 Resultsp
Landing Pagel
? shortcuts  ·  ⌘K command palette  ·  T change theme