From gpt2-experiments to qwen3-experiments-rs
A look back at several weeks of code, Rust, assembly, and kernels in metal or vulkan.
From GPT-2 to Qwen3
After exploring the fundamentals with GPT-2, the next step is local inference of modern models: Qwen 3, both dense and MoE.
The goal is not just to "make it run", but to understand the real bottlenecks.
The Finding: Bandwidth, Not FLOPs
Local CPU inference is not a matter of FLOPs, but of memory bandwidth.
Benchmarking a Mac Studio M3 Ultra reveals a fascinating fact: it is an incredible machine, but the CPU is limited by the hardware. It physically cannot access the full theoretical bandwidth (the famous 800GB/s).
Even with an impressive memory controller and SLC (System Level Cache), for pure CPU workloads, other architectures can paradoxically offer more "plumbing".
The DRAM Wall in Numbers
Measurements on Qwen3 clearly show the DRAM bandwidth wall:
| Precision | Bits/weight | Decode tok/s |
|---|---|---|
| BF16 | 16 | 2.83 |
| INT8 | 8 | 5.35 |
| Q4 | 4 | 9.69 |
Token throughput is almost linearly proportional to the reduction in bits per weight. This is the signature of a purely memory-bound regime.
From Experimentation to Product
There is enough material for several articles, but the focus is shifting: moving from pure experimentation to building a product usable by everyone.
This is the only way to make these discoveries concrete.
Experiments continue in the background (memory reorganization, new avenues), but consolidation comes first.