CPU Inference #0: understand before optimizing

I wrote herbert-rs, a local LLM inference engine in Rust. Not a wrapper, not a binding: every matrix multiplication, every attention kernel, every routine is implemented and measured down to the cycle — in x86-64 and ARM64 assembly, in Metal and Vulkan compute shaders, with hardware performance counters (PMC) to validate every decision.

This project rests on a few simple principles:

Primacy of measurement. Every technical claim is grounded in reproducible measurements. No unverified intuition, no opaque benchmarks, no numbers without protocol.
Understand before optimizing. A system should not be optimized before it is understood. Performance emerges from microarchitecture, memory, algorithms, and model structure.
No marketing hype. Herbert does not aim to prove it is "the fastest." It aims to understand how an inference engine actually works.
Transmission. Every discovery should be explained, contextualized, and published. The goal is to produce useful knowledge.

Modern inference systems are governed by a large number of variables:

hardware architecture
memory hierarchy
processor microarchitecture
kernel implementation
model structure
execution parameters

Every optimization must be preceded and followed by a rigorous measurement campaign. No conclusion can be drawn without:

an explicit protocol
reproducible measurements
documented analysis

Regressions and negative results must be preserved and explained.

The goal is not merely to optimize an engine, but to empirically understand the behavior of inference systems.

This will take time. The articles in this series will go deep — layer by layer, instruction by instruction, measurement by measurement.

The first article, the three inference regimes, laid the foundations: prefill, decode, and question-prefill obey different physical laws.

The next article will go inside the model itself: where time goes, layer by layer.

herbert-rs is open source. Code, benchmarks, and technical discussions are welcome.

Questions about this article or your own project? Book a consultation

CPU Inference -- ongoing series