Hardware planning • Based on real benchmarks

On-Premise LLM
Infrastructure Guide

Memory Bandwidth is the Bottleneck

Understanding what drives local LLM performance, and choosing the right hardware for your workload.

Benchmarks 147 real tests

Hardware M2 Max 32GB

Models 7 tested

Framework Ollama + llama.cpp

From Benchmarks to Hardware

Phase 1 revealed the gap. Phase 2 explains why — and how to close it.

In Phase 1, we benchmarked 5 models across 21 coding prompts. The best local model (qwen3-coder at 48.8 tok/s) delivered solid quality for free — but cloud models ran 3.5x faster. Why? And can better hardware close that gap?

48.8

Best Local TPS

169.7

Best Cloud TPS

3.5x

Speed Gap

400

M2 Max GB/s

The answer is memory bandwidth. Scroll down to see the formula, the math, and the hardware that can push local TPS past 90.

01 / The Core Insight

Why Memory Bandwidth Determines TPS

Local LLM inference is memory-bound, not compute-bound. Every token generated requires reading the entire model from memory once. The speed of that read is your throughput ceiling.

The Fundamental Formula

TPS = (Memory Bandwidth GB/s / Model Size GB) x Efficiency

Efficiency factor: 0.5 - 0.7 typical for Ollama/llama.cpp on Apple Silicon

Memory Bandwidth = Pipe Diameter

400 GB/s on M2 Max

How fast data can flow from unified memory to the GPU/CPU. Wider pipe = more data per second = more tokens.

Model Size = Water Volume Per Token

15.3 GB for qwen3-coder Q4

Each token requires reading the full model weights. Bigger model = more data per token = slower generation.

VALIDATION AGAINST REAL DATA

Predicted: 400 / 15.3 x 0.6 = 15.7 TPS

Actual (dense equivalent): 48.8 TPS (MoE: only 8B active)

qwen3-coder uses MoE architecture (8B active of 30B total), so effective model size is ~5.2 GB. 400/5.2 x 0.6 = 46.2 TPS -- very close to actual 48.8 TPS.

02 / Calculator

Interactive TPS Projector

Adjust bandwidth, model size, and efficiency to project tokens-per-second for any hardware + model combination.

Memory Bandwidth 400 GB/s

Model Size 15.0 GB

Efficiency Factor 0.60

M2 Max Benchmark Reference

qwen3-coder (MoE, ~5.2 GB active)48.8 TPS actual

qwen2.5-coder:14b (7.4 GB Q4)15.6 TPS actual

deepseek-r1:14b (7.4 GB Q4)14.6 TPS actual

glm-4.7-flash (~5 GB Q4)10.2 TPS actual

Projected TPS

16.0

tokens / second

400 GB/s / 15.0 GB x 0.60

04 / Projections

TPS Across Hardware

Projected tokens-per-second for qwen3-coder on each hardware configuration. The M2 Max bar shows actual benchmark data.

qwen3-coder Projected TPS by Hardware

Horizontal bars. Green = actual measured data. Blue = projected from bandwidth formula.

05 / Model Fit

Can Your Hardware Run It?

Select a model and quantization to see which hardware configs can fit it in RAM, and the projected throughput.

Model

Quantization

RAM requirement includes ~2-4 GB overhead for KV cache and runtime. Models need to fit entirely in RAM for acceptable performance (no swap).

06 / Cluster

exo Cluster Estimator

exo lets you pool multiple Macs into a single inference cluster. Add Mac Mini nodes to see aggregate performance projections.

Mac Mini M4 Pro Nodes 2 nodes

Scaling efficiency: 95% for 2 nodes, 85% for 3-4, 75% for 5+ (network overhead). Based on exo/Thunderbolt clustering benchmarks.

07 / Cost

Total Cost of Ownership

Local hardware amortized over 3 years at 25% utilization vs cloud API pricing. Where is the breakeven?

Local: Cost per 1M Tokens

Hardware amortized over 3 years, 25% utilization, electricity included

Cloud: API Pricing (per 1M output tokens)

Pay-per-use, no hardware investment, scales instantly

Breakeven Analysis: Tokens to Pay Off Hardware

How many million output tokens until local hardware costs less than each cloud tier. Lower = faster payoff.

08 / Key Insights

What You Need to Know

Four critical facts about running LLMs locally on Apple Silicon.

💾

More RAM Does Not Mean Faster

RAM capacity determines the largest model you can run, not how fast it runs. A 70B model on a 192GB Mac runs at the same TPS as on a 512GB Mac (same bandwidth chip). RAM is about fit, not speed.

🧩

MoE Architecture Advantage

Mixture-of-Experts models like qwen3-coder activate only a fraction of parameters per token (8B of 30B). This means the effective bandwidth cost per token is much lower, giving 3x the TPS of a dense model at the same parameter count.

🍎

MLX is 30-50% Faster

Apple's MLX framework is optimized for Apple Silicon's unified memory architecture. Compared to Ollama/llama.cpp, MLX typically achieves 30-50% higher TPS on the same hardware by better utilizing the memory subsystem.

🌡

Thermal Throttling Matters

MacBook Pros throttle under sustained load, dropping 10-25% performance after 5-10 minutes. Mac Studio and Mac Pro have superior cooling. For 24/7 inference servers, desktop form factors are strongly preferred.

09 / Decision Helper

Which Hardware Should You Buy?

Answer four questions and get a personalized recommendation based on your budget, team size, model needs, and portability requirements.

Budget

Use Case

Model Size

Portability

On-Premise LLM
Infrastructure Guide

Phase 1 revealed the gap. Phase 2 explains why — and how to close it.

Why Memory Bandwidth Determines TPS

The Fundamental Formula

Interactive TPS Projector

Apple Silicon Lineup

TPS Across Hardware

qwen3-coder Projected TPS by Hardware

Can Your Hardware Run It?

exo Cluster Estimator

Total Cost of Ownership

Local: Cost per 1M Tokens

Cloud: API Pricing (per 1M output tokens)

Breakeven Analysis: Tokens to Pay Off Hardware

What You Need to Know

More RAM Does Not Mean Faster

MoE Architecture Advantage

MLX is 30-50% Faster

Thermal Throttling Matters

Which Hardware Should You Buy?

Keyboard Shortcuts

On-Premise LLMInfrastructure Guide

Phase 1 revealed the gap. Phase 2 explains why — and how to close it.

Why Memory Bandwidth Determines TPS

The Fundamental Formula

Interactive TPS Projector

Apple Silicon Lineup

TPS Across Hardware

qwen3-coder Projected TPS by Hardware

Can Your Hardware Run It?

exo Cluster Estimator

Total Cost of Ownership

Local: Cost per 1M Tokens

Cloud: API Pricing (per 1M output tokens)

Breakeven Analysis: Tokens to Pay Off Hardware

What You Need to Know

More RAM Does Not Mean Faster

MoE Architecture Advantage

MLX is 30-50% Faster

Thermal Throttling Matters

Which Hardware Should You Buy?

Keyboard Shortcuts

On-Premise LLM
Infrastructure Guide