Qwen3 vs GPT-OSS: CPU Benchmark Without GPU
I compared Qwen3 30B-A3B and GPT-OSS 20B on an AMD EPYC 4344P 8-Core without GPU, and the results are surprising.
The raw numbers
| Model | Tokens/second |
|---|---|
| GPT-OSS 20B | 7.50 tks/s |
| Qwen3 30B-A3B (Q8_0) | 15.12 tks/s |
| Qwen3 30B-A3B (Q4_K_M) | 24.02 tks/s |
| Qwen3 30B (think mode) | 22.99 tks/s |
Both models use a MOE (Mixture of Experts) architecture: a kind of panel of smaller but more specialized experts, where only certain experts are activated per token to reduce computation.
Why is Qwen3 so fast with 50% more parameters?
This is the central question. Qwen3 has 30 billion parameters versus 20 billion for GPT-OSS, yet it is two to three times faster on CPU. An image by Sebastian Raschka posted on X provides some clues.
Qwen3 30B architecture
- 128 experts / 8 active per token
- 48 layers
- FFN: 768
- Embeddings: 2048
- 32 attention heads
- Context: 262k tokens
- Approximately 3B active parameters
GPT-OSS 20B architecture
- 32 experts / 4 active per token
- 24 layers
- FFN: 2880
- Embeddings: 2880
- 64 attention heads
- Context: 131k tokens
- Approximately 3B active parameters as well
Expert size makes all the difference
This is the key to the mystery. With much lighter intermediate layers (FFN) than GPT-OSS, each computation in Qwen3 is significantly cheaper. Even with 8 active experts (versus 4 for GPT-OSS), the total cost remains lower.
Attention: an often underestimated factor
To drive the point home, Qwen3's 32 attention heads versus 64 for GPT-OSS considerably reduce QKV (Query, Key, Value) operations, a major resource consumer. This is a significant advantage during inference, particularly on CPU.
Deeper but lighter per layer
With 48 layers versus 24, Qwen3 is deeper, but each layer is lighter. On CPU, this matters enormously: the processor can handle each layer quickly, and the additional depth allows for better reasoning quality without blowing up computation time.
The MXFP4 format unknown
One unknown remains: the MXFP4 (4-bit) format used by GPT-OSS could also penalize its performance on CPUs that lack dedicated hardware to leverage this format efficiently. Without native hardware support, the processor must perform additional conversions, which slows down inference. This is a point worth watching.
Conclusion
This benchmark shows that a model's total parameter count is not the determining factor for CPU inference performance. What truly matters is:
- Active expert size: smaller experts = faster computations
- Number of attention heads: fewer heads = fewer QKV operations
- Depth/width balance: more lightweight layers rather than fewer heavy layers
- Quantization format compatibility with the target hardware
Shall we re-run these tests on GPU?