Qwen3 vs GPT-OSS: CPU Benchmark Without GPU

I compared Qwen3 30B-A3B and GPT-OSS 20B on an AMD EPYC 4344P 8-Core without GPU, and the results are surprising.

The raw numbers

Model	Tokens/second
GPT-OSS 20B	7.50 tks/s
Qwen3 30B-A3B (Q8_0)	15.12 tks/s
Qwen3 30B-A3B (Q4_K_M)	24.02 tks/s
Qwen3 30B (think mode)	22.99 tks/s

Both models use a MOE (Mixture of Experts) architecture: a kind of panel of smaller but more specialized experts, where only certain experts are activated per token to reduce computation.

Why is Qwen3 so fast with 50% more parameters?

This is the central question. Qwen3 has 30 billion parameters versus 20 billion for GPT-OSS, yet it is two to three times faster on CPU. An image by Sebastian Raschka posted on X provides some clues.

Qwen3 30B architecture

128 experts / 8 active per token
48 layers
FFN: 768
Embeddings: 2048
32 attention heads
Context: 262k tokens
Approximately 3B active parameters

GPT-OSS 20B architecture

32 experts / 4 active per token
24 layers
FFN: 2880
Embeddings: 2880
64 attention heads
Context: 131k tokens
Approximately 3B active parameters as well

Expert size makes all the difference

This is the key to the mystery. With much lighter intermediate layers (FFN) than GPT-OSS, each computation in Qwen3 is significantly cheaper. Even with 8 active experts (versus 4 for GPT-OSS), the total cost remains lower.

Attention: an often underestimated factor

To drive the point home, Qwen3's 32 attention heads versus 64 for GPT-OSS considerably reduce QKV (Query, Key, Value) operations, a major resource consumer. This is a significant advantage during inference, particularly on CPU.

Deeper but lighter per layer

With 48 layers versus 24, Qwen3 is deeper, but each layer is lighter. On CPU, this matters enormously: the processor can handle each layer quickly, and the additional depth allows for better reasoning quality without blowing up computation time.

The MXFP4 format unknown

One unknown remains: the MXFP4 (4-bit) format used by GPT-OSS could also penalize its performance on CPUs that lack dedicated hardware to leverage this format efficiently. Without native hardware support, the processor must perform additional conversions, which slows down inference. This is a point worth watching.

Conclusion

This benchmark shows that a model's total parameter count is not the determining factor for CPU inference performance. What truly matters is:

Active expert size: smaller experts = faster computations
Number of attention heads: fewer heads = fewer QKV operations
Depth/width balance: more lightweight layers rather than fewer heavy layers
Quantization format compatibility with the target hardware

Shall we re-run these tests on GPU?