Memory Bandwidth is the Bottleneck
Understanding what drives local LLM performance, and choosing the right hardware for your workload.
In Phase 1, we benchmarked 5 models across 21 coding prompts. The best local model (qwen3-coder at 48.8 tok/s) delivered solid quality for free — but cloud models ran 3.5x faster. Why? And can better hardware close that gap?
Local LLM inference is memory-bound, not compute-bound. Every token generated requires reading the entire model from memory once. The speed of that read is your throughput ceiling.
Adjust bandwidth, model size, and efficiency to project tokens-per-second for any hardware + model combination.
Six hardware configurations compared. Projected TPS based on the bandwidth formula with qwen3-coder (15.3 GB Q4, MoE ~5.2 GB active).
Projected tokens-per-second for qwen3-coder on each hardware configuration. The M2 Max bar shows actual benchmark data.
Select a model and quantization to see which hardware configs can fit it in RAM, and the projected throughput.
exo lets you pool multiple Macs into a single inference cluster. Add Mac Mini nodes to see aggregate performance projections.
Local hardware amortized over 3 years at 25% utilization vs cloud API pricing. Where is the breakeven?
Four critical facts about running LLMs locally on Apple Silicon.
RAM capacity determines the largest model you can run, not how fast it runs. A 70B model on a 192GB Mac runs at the same TPS as on a 512GB Mac (same bandwidth chip). RAM is about fit, not speed.
Mixture-of-Experts models like qwen3-coder activate only a fraction of parameters per token (8B of 30B). This means the effective bandwidth cost per token is much lower, giving 3x the TPS of a dense model at the same parameter count.
Apple's MLX framework is optimized for Apple Silicon's unified memory architecture. Compared to Ollama/llama.cpp, MLX typically achieves 30-50% higher TPS on the same hardware by better utilizing the memory subsystem.
MacBook Pros throttle under sustained load, dropping 10-25% performance after 5-10 minutes. Mac Studio and Mac Pro have superior cooling. For 24/7 inference servers, desktop form factors are strongly preferred.
Answer four questions and get a personalized recommendation based on your budget, team size, model needs, and portability requirements.