CPU Inference #1: The three regimes

Inference throughput is not a scalar.

Everyone benchmarks LLM inference with a single number: "X tokens per second." I used to do the same. Then I built a CPU inference engine for a 30B parameter model from scratch (in Rust, with hand-written assembly) and that number stopped making any sense.

This is not about prompting. Not about model selection. This is about resource allocation at inference time, where the real bottleneck shifts depending on what the hardware is actually doing. I identified three distinct physical regimes. Same engine, same weights. Three different sets of laws governing throughput.

Here's what I measured.

The three inference regimes


The model is Qwen3-30B-A3B. A Mixture-of-Experts architecture. 48 transformer layers. 32 attention heads, 4 KV heads (GQA 8:1). Hidden dimension 2048. It's a real production model, and I run it on two machines: an AMD Ryzen 9 7900 with AVX-512 (12 cores, Zen 4) and an Apple M3 Ultra with NEON (28 cores). No GPU. Just CPU cores, cache hierarchies, and memory buses.

When I benchmarked the engine in Q4, the results didn't form any straight line:

                  Ryzen 9 7900            M3 Ultra
Prompt tokens    Prefill  Decode     Prefill  Decode
13                 47.4    21.7        34.9    28.1
36                 68.6    21.6       103.9    28.0
81                 76.0    21.6       171.4    27.9
182                87.9    21.5       244.4    26.2
461                88.5    22.4       342.3    28.1
1,096              89.8    21.6       347.7    26.5
1,795              87.2    20.8       309.2    25.2
4,239              79.5    18.3       223.5    21.7
8,134              69.0    15.2       152.2    17.6

Prefill climbs, plateaus between 500 and 1,100 tokens, then declines. The M3 Ultra peaks at 348 tok/s (nearly 4x the Ryzen) thanks to its 28 cores and unified memory bandwidth. Decode stays flat around 22 tok/s (Ryzen) and 28 tok/s (M3 Ultra) across most of the range, then drops past 2,000 tokens when the KV cache spills to DRAM.

Prefill vs Decode throughput by prompt length

Same model. Same code. Two curves that look nothing alike. These numbers hide three completely different realities.

Prefill is a compute problem. Your prompt has S tokens. Each layer performs three matrix multiplications (Q, K, V projections), an attention computation, and a feed-forward network. The matrices are big: [S, 2048] × [2048, 4096] for the Q projection alone. That's 16 million multiply-adds per layer. Times 48 layers. The CPU is doing real arithmetic work. All cores are busy. IPC is high.

Decode is a memory problem. You're generating one token at a time. Each layer still runs the same operations, but now S=1. A matrix multiplication becomes a matrix-vector product. [1, 2048] × [2048, 4096] is just 8 million multiply-adds, but you only get to reuse each weight once. The entire model's weight tensor must stream through the CPU once per token. For a 30B model in Q4 quantization, that's roughly 500 MB of DRAM traffic per token generated. The CPU isn't computing. It's waiting for data.

Same model. Same code path. Two completely different bottlenecks. This is the first regime boundary: Streaming Prefill is compute-saturated, Memory-Bound Decode is bandwidth-starved. Optimize for one, you miss the other. Optimize for both at once, you optimize for neither.

I learned this the hard way. Every optimization that helped prefill was irrelevant to decode, and vice versa. Some helped one and actively hurt the other.


But it gets worse. Even within a single phase, performance isn't constant.

I profiled every layer of the model during prefill with a 1954-token prompt. Here's what I saw:

Layer 0:  attn=6442ms  ffn=552ms   total=7004ms
Layer 8:  attn=3203ms  ffn=520ms   total=3726ms
Layer 17: attn=3837ms  ffn=400ms   total=4240ms
Layer 29: attn=6448ms  ffn=561ms   total=7015ms
Layer 47: attn=6187ms  ffn=567ms   total=6759ms

Layers 8, 17, 27, 36, 45 are nearly 2x faster than their neighbors. Same architecture, same weights format, but different MoE routing patterns. Some layers activate sparser sets of experts, producing less work. The execution time per layer varies by a factor of two within the same inference pass.

Then I measured attention kernel throughput in isolation, varying the KV-cache size on a single core:

10 positions    (0.0 MB KV in L1)   →  92 GFLOP/s
100 positions   (0.4 MB KV in L2)   →  84 GFLOP/s
1000 positions  (3.9 MB KV in L3)   →  91 GFLOP/s
10000 positions (39 MB KV in DRAM)   →  56 GFLOP/s
17000 positions (66 MB KV in DRAM)   →  45 GFLOP/s

Same kernel. Same core. Same instruction stream. Same clock rate. 50% throughput variation depending on where the KV cache happens to live in the memory hierarchy. And the relationship isn't even monotonic: L3 is faster than L2, because the hardware prefetcher excels at sequential L3 streaming but L2 capacity forces evictions.


Then you add threads. And the physics changes again.

I measured scaling on the Ryzen 9 (12 physical cores, 24 logical with SMT, split across two Zen 4 chiplets) with a 1,000-token prompt:

Threads   Prefill tok/s   Decode tok/s
1           9.7  (1.0x)     5.4  (1.0x)
2          19.5  (2.0x)    10.3  (1.9x)
4          38.2  (3.9x)    16.8  (3.1x)
6          54.3  (5.6x)    17.8  (3.3x)
8          68.6  (7.1x)    21.6  (4.0x)
10         80.3  (8.3x)    21.6  (4.0x)
12         89.7  (9.2x)    21.6  (4.0x)  ← 12 physical cores
14         63.8  (6.6x)    20.8  (3.9x)  ← SMT CLIFF
16         71.4  (7.4x)    20.5  (3.8x)
20         76.3  (7.9x)    20.1  (3.7x)
24         79.9  (8.2x)    18.9  (3.5x)

Thread scaling: prefill vs decode on Ryzen 9

Prefill scales nearly linearly up to 12 physical cores (9.2x for 12x). Then at 14 threads, SMT kicks in and prefill collapses by 29%. The L2 cache is 1 MB per core, shared between SMT siblings. Each attention layer streams 3.8 MB of KV data, nearly 4x what the L2 can hold. Adding a second logical thread per core halves the effective L2 to 512 KB/thread, and the hardware prefetcher collapses. Prefill never recovers its 12-thread peak, even at 24T.

Decode saturates at 8 threads (21.6 tok/s) and flatlines. Beyond that, each added logical thread slightly degrades throughput: 20.8 at 14T, 18.9 at 24T. SMT doesn't help either, because decode is already DRAM-bandwidth-bound, not instruction-latency-bound. There are no pipeline bubbles to fill.


Then you change the quantization format. And the two regimes react in opposite directions.

I compared Q4 (4-bit, with dequantization) and int8 (8-bit, native integer arithmetic) on the Ryzen with 12 threads:

                  Q4                    Int8
Prompt tokens   Prefill  Decode     Prefill  Decode
81               76.1    21.7        83.2    14.7
1,096            89.6    21.6       107.7    14.5
8,134            69.5    15.2        80.4    11.4

Q4 vs Int8: the prefill/decode trade-off

Int8 prefill is 20% faster: it skips the Q4→f32 dequantization step and directly uses the Ryzen's VNNI instructions. But Q4 decode is 49% faster: Q4 weights generate half the DRAM traffic of int8 weights.

The trade-off is clear: int8 for compute, Q4 for bandwidth. And since decode dominates total time in conversational inference, Q4 wins overall.


But the real pipeline is not linear.

Everything above assumes a clean sequence: one prefill, then decode. In practice, with tool-augmented models, that never happens.

Consider a concrete scenario. The user asks: "What's the error in this file?": 20 tokens, plus a system prompt and conversation history of maybe 200 tokens. Total prefill: 220 tokens. The matrices are small. Thread dispatch and synchronization overhead dominate the actual arithmetic. This is Dispatch-Bound Prefill, the third regime, the one that doesn't appear in the benchmarks above because I was measuring with 100+ token prompts. But the hint is already there: at 81 tokens, prefill runs at 76 tok/s on the Ryzen. At 1,096 tokens, it peaks at 90, i.e. 18% faster despite 13x more work. On the M3 Ultra, from 171 to 348 tok/s. The matrices at 81 tokens are already borderline. Below that, dispatch overhead dominates.

The model decodes a tool call: 30 tokens. Each token streams the full weight tensor through DRAM. Memory-Bound Decode.

The tool returns 5,000 tokens of file content. Now the engine runs a second prefill, but this time the matrices are large. [5000, 2048] × [2048, 4096]. All cores saturate. The hardware prefetcher streams L2 at full bandwidth. Streaming Prefill, a completely different physical regime from the first prefill, triggered within the same inference pass.

Decode resumes. But now the KV cache holds 5,250 positions instead of 220. The attention kernel that was running at 92 GFLOP/s with the small cache is now closer to 56 GFLOP/s, because 5,250 positions push the KV data from L3 into DRAM.

Same binary. Same weights. Same prompt. Same conversation turn. Four phase transitions across all three regimes. And the throughput number changes at each transition.

This is why the regime model matters beyond benchmarking. A single tool-augmented inference pass doesn't live in one regime: it traverses all of them. The "tokens per second" you measure depends entirely on which phase you happen to be timing.


The real lesson isn't about any single optimization. It's that LLM inference on CPU is not one workload. It's at least three distinct physical regimes:

  1. Streaming Prefill (long prompt, big matrices): L2-bandwidth-bound. HW prefetcher matters. SMT hurts. Compute is plentiful. The CPU is busy.

  2. Dispatch-Bound Prefill (short prompt, warm KV cache): overhead-dominated. The work units are so small that thread synchronization costs more than the computation itself. Scaling from 1 to 8 threads gives only 4x for decode.

  3. Memory-Bound Decode (one token at a time): DRAM-bandwidth-bound weight streaming. SMT helps. ~535 MB of DRAM traffic per generated token. The CPU is mostly waiting.

Same engine. Same binary. Three different sets of physical laws.

This is not an academic distinction. It has direct engineering consequences: thread pool sizing, memory page strategy, cache partitioning, quantization format. Every decision depends on which regime dominates your workload. A chatbot answering short questions lives in Dispatch-Bound Prefill and Memory-Bound Decode. A RAG pipeline processing long documents lives in Streaming Prefill. The optimal configuration for each is different, sometimes opposite.

When someone tells you their inference engine runs at "X tokens per second", ask them: at what sequence length? During which phase? On how many threads? With what cache state? Cold or warm?

Key takeaways

Inference throughput is not a number. It's a function of regime.

And if you optimize without knowing which regime you're in, you're optimizing blind.


The source code for the experiments presented in this article is available on GitHub.