Beating PyTorch with Rust and 180 Lines of Assembly

Can you beat PyTorch on its own turf with Rust and 180 lines of assembly? When inference is M=1, cache decides everything.

The Initial Wall

While working on a GPT-2 implementation in Rust on Apple Silicon, the result is brutal: even optimized, the Rust version remains slower than PyTorch + Accelerate. Industrial BLAS libraries are excellent, and that is expected.

But frameworks like PyTorch are generalists. Token-by-token inference is anything but a general case.

Exploring the Memory Hierarchy

After years of work on CPU optimization, SIMD instructions and memory hierarchy, the same question keeps coming back: how to improve LLM inference in very specific cases, especially locally?

The approach starts simple: a GPT-2 engine in numpy. Feel where the cost lies. Observe. Measure. Thanks to the work of Karpathy and Raschka, it is a real pleasure.

The Problem Is Not Compute

What clearly emerges:

The problem is not compute power
It is cache management
Token-by-token inference is memory-bound, not compute-bound

The K-outer Pattern in Assembly

By rewriting the kernels in ARM64 NEON f16 assembly, with a specialized "K-outer" pattern, the result is spectacular: 156.9 tokens/s on GPT-2 Small.

That is 3x faster than PyTorch on the same machine.

Benchmarks on Apple Silicon M3:

Model	PyTorch (tok/s)	Rust V3 (tok/s)	Speedup
GPT-2 Small	53	156.9	3.0x
GPT-2 Medium	22	59.3	2.7x
GPT-2 Large	13	29.1	2.2x
GPT-2 XL	7.7	16.2	2.1x

What Comes Next?

Today, a Qwen3-VL-30B-A3B already runs on an inference engine in Rust and assembly. And there are many avenues left to explore: Metal, Vulkan, AVX512...

GEMV specialization versus GEMM generalization, native f16 on NEON as a powerful weapon for edge inference: if there is one takeaway, it is that understanding your hardware changes the problem entirely.