CPU Inference #0: understand before optimizing

I wrote herbert-rs, a local LLM inference engine in Rust. Not a wrapper, not a binding: every matrix multiplication, every attention kernel, every routine is implemented and measured down to the cycle — in x86-64 and ARM64 assembly, in Metal and Vulkan compute shaders, with hardware performance counters (PMC) to validate every decision.

This project rests on a few simple principles:

Modern inference systems are governed by a large number of variables:

Every optimization must be preceded and followed by a rigorous measurement campaign. No conclusion can be drawn without:

Regressions and negative results must be preserved and explained.

The goal is not merely to optimize an engine, but to empirically understand the behavior of inference systems.


This will take time. The articles in this series will go deep — layer by layer, instruction by instruction, measurement by measurement.

The first article, the three inference regimes, laid the foundations: prefill, decode, and question-prefill obey different physical laws.

The next article will go inside the model itself: where time goes, layer by layer.


herbert-rs is open source. Code, benchmarks, and technical discussions are welcome.

CPU Inference -- ongoing series

  1. CPU Inference #0: understand before optimizing
  2. CPU Inference #1: The three regimes