Qwen3 vs GPT-OSS: CPU Benchmark Without GPU

I compared Qwen3 30B-A3B and GPT-OSS 20B on an AMD EPYC 4344P 8-Core without GPU, and the results are surprising.

The raw numbers

ModelTokens/second
GPT-OSS 20B7.50 tks/s
Qwen3 30B-A3B (Q8_0)15.12 tks/s
Qwen3 30B-A3B (Q4_K_M)24.02 tks/s
Qwen3 30B (think mode)22.99 tks/s

Both models use a MOE (Mixture of Experts) architecture: a kind of panel of smaller but more specialized experts, where only certain experts are activated per token to reduce computation.

Why is Qwen3 so fast with 50% more parameters?

This is the central question. Qwen3 has 30 billion parameters versus 20 billion for GPT-OSS, yet it is two to three times faster on CPU. An image by Sebastian Raschka posted on X provides some clues.

Qwen3 30B architecture

GPT-OSS 20B architecture

Expert size makes all the difference

This is the key to the mystery. With much lighter intermediate layers (FFN) than GPT-OSS, each computation in Qwen3 is significantly cheaper. Even with 8 active experts (versus 4 for GPT-OSS), the total cost remains lower.

Attention: an often underestimated factor

To drive the point home, Qwen3's 32 attention heads versus 64 for GPT-OSS considerably reduce QKV (Query, Key, Value) operations, a major resource consumer. This is a significant advantage during inference, particularly on CPU.

Deeper but lighter per layer

With 48 layers versus 24, Qwen3 is deeper, but each layer is lighter. On CPU, this matters enormously: the processor can handle each layer quickly, and the additional depth allows for better reasoning quality without blowing up computation time.

The MXFP4 format unknown

One unknown remains: the MXFP4 (4-bit) format used by GPT-OSS could also penalize its performance on CPUs that lack dedicated hardware to leverage this format efficiently. Without native hardware support, the processor must perform additional conversions, which slows down inference. This is a point worth watching.

Conclusion

This benchmark shows that a model's total parameter count is not the determining factor for CPU inference performance. What truly matters is:

Shall we re-run these tests on GPU?