CPU Inference #1: The Three Regimes

Inference throughput is not a scalar.

Everyone benchmarks LLM inference with a single number: "X tokens per second." I used to do the same. Then I built a CPU inference engine for a 30B parameter model from scratch — in Rust, with hand-written assembly — and that number stopped making any sense.

This is not about prompting. Not about model selection. This is about resource allocation at inference time — where the real bottleneck shifts depending on what the hardware is actually doing. I identified three distinct physical regimes. Same engine, same binary, same weights — three different sets of laws governing throughput.

Here's what I measured.


The model is Qwen3-30B-A3B. A Mixture-of-Experts architecture. 48 transformer layers. 32 attention heads, 4 KV heads (GQA 8:1). Hidden dimension 2048. It's a real production model, and I run it on two machines: an AMD Ryzen 9 7900 with AVX-512 (12 cores, Zen 4) and an Apple M3 Ultra with NEON (28 cores). No GPU. Just CPU cores, cache hierarchies, and memory buses.

When I benchmarked the engine in Q4 on the Ryzen, the results were contradictory:

Prompt length   Prefill tok/s   Decode tok/s
100 tokens      191             21.6
1,000 tokens    247             20.8
10,000 tokens   134             14.8

Prefill gets faster when you multiply the prompt by 10 — then crashes when you do it again. Decode slowly degrades, almost indifferent to the wild swings in prefill. Same model, same code, same machine.

On the M3 Ultra, the same pattern — even more pronounced:

Prompt length   Prefill tok/s   Decode tok/s
100 tokens      224             19.5
1,000 tokens    316             19.0
10,000 tokens   145             13.8

Prefill vs Decode throughput by prompt length

These numbers hide three completely different realities.

Prefill is a compute problem. Your prompt has S tokens. Each layer performs three matrix multiplications (Q, K, V projections), an attention computation, and a feed-forward network. The matrices are big: [S, 2048] × [2048, 4096] for the Q projection alone. That's 16 million multiply-adds per layer. Times 48 layers. The CPU is doing real arithmetic work. All cores are busy. IPC is high.

Decode is a memory problem. You're generating one token at a time. Each layer still runs the same operations — but now S=1. A matrix multiplication becomes a matrix-vector product. [1, 2048] × [2048, 4096] is just 8 million multiply-adds, but you only get to reuse each weight once. The entire model's weight tensor must stream through the CPU once per token. For a 30B model in Q4 quantization, that's roughly 500 MB of DRAM traffic per token generated. The CPU isn't computing. It's waiting for data.

Same model. Same code path. Two completely different bottlenecks. This is the first regime boundary: Streaming Prefill is compute-saturated, Memory-Bound Decode is bandwidth-starved. Optimize for one, you miss the other. Optimize for both at once, you optimize for neither.

I learned this the hard way. Every optimization that helped prefill was irrelevant to decode. Every optimization that helped decode was irrelevant to prefill. Some optimizations helped one and actively hurt the other.


But it gets worse. Even within a single phase, performance isn't constant.

I profiled every layer of the model during prefill with a 1954-token prompt. Here's what I saw:

Layer 0:  attn=6442ms  ffn=552ms   total=7004ms
Layer 8:  attn=3203ms  ffn=520ms   total=3726ms
Layer 17: attn=3837ms  ffn=400ms   total=4240ms
Layer 29: attn=6448ms  ffn=561ms   total=7015ms
Layer 47: attn=6187ms  ffn=567ms   total=6759ms

Layers 8, 17, 27, 36, 45 are nearly 2x faster than their neighbors. Same architecture, same code, same weights format — but different MoE routing patterns. Some layers activate sparser sets of experts, producing less work. The execution time per layer varies by a factor of two within the same inference pass.

Then I measured attention kernel throughput in isolation, varying the KV-cache size on a single core:

10 positions    (0.0 MB KV in L1)   →  92 GFLOP/s
100 positions   (0.4 MB KV in L2)   →  84 GFLOP/s
1000 positions  (3.9 MB KV in L3)   →  91 GFLOP/s
10000 positions (39 MB KV in DRAM)   →  56 GFLOP/s
17000 positions (66 MB KV in DRAM)   →  45 GFLOP/s

Same kernel. Same instruction stream. Same core. 50% throughput variation depending on where the KV cache happens to live in the memory hierarchy. And the relationship isn't even monotonic — L3 is faster than L2, because the hardware prefetcher excels at sequential L3 streaming but L2 capacity forces evictions.


Then you add threads. And the physics changes again.

I measured prefill scaling on the Ryzen 9 (12 physical cores, 24 logical with SMT, split across two Zen 4 chiplets):

1 thread:   29.7 tok/s     (1.00x)
4 threads:  95.8 tok/s     (3.23x, 81% efficiency)
8 threads:  151.2 tok/s    (5.09x, 64% efficiency)
12 threads: 141.8 tok/s    (4.77x — SLOWER than 8)
16 threads: 147.7 tok/s    (4.97x — barely recovers)

Scaling beyond 8 threads hurts prefill. The Zen 4 architecture splits the 12 cores across two chiplets (CCDs), each with its own 32 MB L3 cache. Crossing the CCD boundary at 12 threads introduces inter-chiplet latency. Then at 16 threads, SMT kicks in: the L2 cache is 1 MB per core, shared between SMT siblings. Each attention layer streams 3.8 MB of KV data — nearly 4x what the L2 can hold. Adding a second logical thread per core halves the effective L2 to 512 KB/thread, and the hardware prefetcher collapses.

But for decode, the picture is reversed:

8 threads:  10.2 tok/s
12 threads: 10.5 tok/s
16 threads: 11.1 tok/s (+8.8% vs 8T)
24 threads: 9.0 tok/s  (oversubscription collapse)

SMT helps decode by 9%. Same machine. Opposite scaling behavior. Because decode is latency-bound (waiting for DRAM), and SMT fills pipeline bubbles while the other thread waits. Prefill is bandwidth-bound (streaming through L2), and SMT competes for that bandwidth.


But the real pipeline is not linear.

Everything above assumes a clean sequence: one prefill, then decode. In practice, with tool-augmented models, that never happens.

Consider a concrete scenario. The user asks: "What's the error in this file?" — 20 tokens, plus a system prompt and conversation history of maybe 200 tokens. Total prefill: 220 tokens. The matrices are small. Thread dispatch and synchronization overhead dominate the actual arithmetic. This is Dispatch-Bound Prefill — the third regime, the one that doesn't appear in the benchmarks above because I was measuring with 100+ token prompts. But the hint is already there: at 100 tokens, prefill runs at 191 tok/s on the Ryzen. At 1,000 tokens, it peaks at 247 — 29% faster despite 10x more work. The matrices at 100 tokens are already borderline. Below that, dispatch overhead dominates.

The model decodes a tool call — 30 tokens. Each token streams the full weight tensor through DRAM. Memory-Bound Decode.

The tool returns 5,000 tokens of file content. Now the engine runs a second prefill — but this time the matrices are large. [5000, 2048] × [2048, 4096]. All cores saturate. The hardware prefetcher streams L2 at full bandwidth. Streaming Prefill — a completely different physical regime from the first prefill, triggered within the same inference pass.

Decode resumes. But now the KV cache holds 5,250 positions instead of 220. The attention kernel that was running at 92 GFLOP/s with the small cache is now closer to 56 GFLOP/s — because 5,250 positions push the KV data from L3 into DRAM.

Same binary. Same weights. Same conversation turn. Four phase transitions across all three regimes. And the throughput number changes at each transition.

This is why the regime model matters beyond benchmarking. A single tool-augmented inference pass doesn't live in one regime — it traverses all of them. The "tokens per second" you measure depends entirely on which phase you happen to be timing.


The real lesson isn't about any single optimization. It's that LLM inference on CPU is not one workload. It's at least three distinct physical regimes:

  1. Streaming Prefill (long prompt, big matrices): L2-bandwidth-bound. HW prefetcher matters. SMT hurts. Compute is plentiful. The CPU is busy.

  2. Dispatch-Bound Prefill (short prompt, warm KV cache): overhead-dominated. The work units are so small that thread synchronization costs more than the computation itself. Scaling from 1 to 8 threads gives only 2.35x.

  3. Memory-Bound Decode (one token at a time): DRAM-bandwidth-bound weight streaming. SMT helps. ~535 MB of DRAM traffic per generated token. The CPU is mostly waiting.

Same engine. Same binary. Three different sets of physical laws.

This is not an academic distinction. It has direct engineering consequences: thread pool sizing, memory page strategy, cache partitioning, quantization format — every decision depends on which regime dominates your workload. A chatbot answering short questions lives in Dispatch-Bound Prefill and Memory-Bound Decode. A RAG pipeline processing long documents lives in Streaming Prefill. The optimal configuration for each is different — sometimes opposite.

When someone tells you their inference engine runs at "X tokens per second" — ask them: at what sequence length? During which phase? On how many threads? With what cache state? Cold or warm?

Inference throughput is not a number. It's a function of regime.

And if you optimize without knowing which regime you're in, you're optimizing blind.