Beating PyTorch with Rust and 180 Lines of Assembly
Can you beat PyTorch on its own turf with Rust and 180 lines of assembly? When inference is M=1, cache decides everything.
The Initial Wall
While working on a GPT-2 implementation in Rust on Apple Silicon, the result is brutal: even optimized, the Rust version remains slower than PyTorch + Accelerate. Industrial BLAS libraries are excellent, and that is expected.
But frameworks like PyTorch are generalists. Token-by-token inference is anything but a general case.
Exploring the Memory Hierarchy
After years of work on CPU optimization, SIMD instructions and memory hierarchy, the same question keeps coming back: how to improve LLM inference in very specific cases, especially locally?
The approach starts simple: a GPT-2 engine in numpy. Feel where the cost lies. Observe. Measure. Thanks to the work of Karpathy and Raschka, it is a real pleasure.
The Problem Is Not Compute
What clearly emerges:
- The problem is not compute power
- It is cache management
- Token-by-token inference is memory-bound, not compute-bound
The K-outer Pattern in Assembly
By rewriting the kernels in ARM64 NEON f16 assembly, with a specialized "K-outer" pattern, the result is spectacular: 156.9 tokens/s on GPT-2 Small.
That is 3x faster than PyTorch on the same machine.
Benchmarks on Apple Silicon M3:
| Model | PyTorch (tok/s) | Rust V3 (tok/s) | Speedup |
|---|---|---|---|
| GPT-2 Small | 53 | 156.9 | 3.0x |
| GPT-2 Medium | 22 | 59.3 | 2.7x |
| GPT-2 Large | 13 | 29.1 | 2.2x |
| GPT-2 XL | 7.7 | 16.2 | 2.1x |
What Comes Next?
Today, a Qwen3-VL-30B-A3B already runs on an inference engine in Rust and assembly. And there are many avenues left to explore: Metal, Vulkan, AVX512...
GEMV specialization versus GEMM generalization, native f16 on NEON as a powerful weapon for edge inference: if there is one takeaway, it is that understanding your hardware changes the problem entirely.