Beating PyTorch with Rust and 180 Lines of Assembly

Can you beat PyTorch on its own turf with Rust and 180 lines of assembly? When inference is M=1, cache decides everything.

The Initial Wall

While working on a GPT-2 implementation in Rust on Apple Silicon, the result is brutal: even optimized, the Rust version remains slower than PyTorch + Accelerate. Industrial BLAS libraries are excellent, and that is expected.

But frameworks like PyTorch are generalists. Token-by-token inference is anything but a general case.

Exploring the Memory Hierarchy

After years of work on CPU optimization, SIMD instructions and memory hierarchy, the same question keeps coming back: how to improve LLM inference in very specific cases, especially locally?

The approach starts simple: a GPT-2 engine in numpy. Feel where the cost lies. Observe. Measure. Thanks to the work of Karpathy and Raschka, it is a real pleasure.

The Problem Is Not Compute

What clearly emerges:

The K-outer Pattern in Assembly

By rewriting the kernels in ARM64 NEON f16 assembly, with a specialized "K-outer" pattern, the result is spectacular: 156.9 tokens/s on GPT-2 Small.

That is 3x faster than PyTorch on the same machine.

Benchmarks on Apple Silicon M3:

ModelPyTorch (tok/s)Rust V3 (tok/s)Speedup
GPT-2 Small53156.93.0x
GPT-2 Medium2259.32.7x
GPT-2 Large1329.12.2x
GPT-2 XL7.716.22.1x

What Comes Next?

Today, a Qwen3-VL-30B-A3B already runs on an inference engine in Rust and assembly. And there are many avenues left to explore: Metal, Vulkan, AVX512...

GEMV specialization versus GEMM generalization, native f16 on NEON as a powerful weapon for edge inference: if there is one takeaway, it is that understanding your hardware changes the problem entirely.