Tile-major layout and transformer memory allocations

When writing a CPU inference engine, two questions keep coming back: how are matrices stored in memory, and how much memory is needed overall. This article covers both. First the tile-major layout used for all projection matrices, then the full memory sizing -- model weights, runtime buffers and KV cache -- for my Qwen3 inference engine.


Tile-major layout

The row-major problem

In standard row-major layout, a matrix [N, K] is stored row by row: the K elements of row 0, then row 1, and so on. For an embedding lookup (reading the row corresponding to a token), this is perfect: K contiguous values in memory.

But for a matrix-vector multiplication (matvec), the situation is different. Computing y = W * x requires, for each output element y[i], the dot product of row i of W with vector x. So far so good. But if you want to vectorize by processing 32 rows simultaneously (which is natural with AVX-512 in f16), you need to read the same column across 32 adjacent rows. In row-major, those 32 values are spaced K elements apart -- strided accesses that destroy cache locality.

The tile-major principle

Tile-major reorganizes a matrix [N, K] as [N/32, K, 32]. Concretely:

That is exactly one cache line. A single memory access loads the 32 elements needed for a partial dot product across 32 simultaneous rows.

Why this changes everything

In row-major, processing 32 rows in parallel on a column requires 32 memory accesses (one per row, with a stride of K x 2 bytes). In tile-major, it is a single contiguous 64-byte access. For a [1024, 1024] matrix in f16, you go from 32 potential cache misses to just 1 per column processed. Across thousands of columns and dozens of layers, the gain is massive.


Per-layer matrices

Each transformer layer stores the following matrices:

MatrixDimensionsSizeLayout
q_proj[1024, 1024]2 MBTile-major
k_proj[512, 1024]1 MBTile-major
v_proj[512, 1024]1 MBTile-major
o_proj[1024, 1024]2 MBTile-major
gate_proj[3072, 1024]6 MBTile-major
up_proj[3072, 1024]6 MBTile-major
down_proj[1024, 3072]6 MBTile-major

Total per layer: ~24 MB of matrices.

The first four (q_proj, k_proj, v_proj, o_proj) are the attention projections. Q and O have the full model dimension (1024) as output, while K and V project to a reduced dimension (512) thanks to Grouped-Query Attention.

The next three (gate_proj, up_proj, down_proj) form the feed-forward network (FFN). gate_proj and up_proj expand from 1024 to 3072 (3x factor). down_proj brings it back from 3072 to 1024. gate_proj controls the activation via SiLU before multiplication with the up_proj output.

All these matrices are stored in tile-major. No exceptions.


Total memory allocations

Model weights

ComponentCalculationSize
Embeddings (row-major)151,936 x 1,024 x 2311 MB
Embeddings (tile-major)151,936 x 1,024 x 2311 MB
Final norm1,024 x 22 KB
28 layers28 x 24 MB672 MB
Total~1.3 GB

Why two copies of the embeddings? The embedding matrix serves two distinct operations:

  1. Embedding lookup (beginning of the model): read the row corresponding to a token. Row-major is optimal -- a sequential read of 1024 f16 values.
  2. LM head (end of the model): matrix-vector multiplication to project the hidden state to the vocabulary (151,936 classes). Tile-major is optimal -- contiguous accesses in tiles of 32 rows.

Duplicating the matrix costs an extra 311 MB. This is a deliberate memory/performance tradeoff: each layout is optimal for its specific use. The tile-major copy is reorganized from embed_tokens at model loading time.

Runtime buffers

Decode mode (token-by-token generation):

BufferSize
Temporary buffers~320 KB
RoPE cache~16 KB (for max_seq_len = 32,768)
KV cachedynamically allocated

Prefill mode (batch prompt processing):

BufferSize
Batch buffers~12 MB (for MAX_PREFILL_LEN = 512)
KV cachedynamically allocated

Prefill buffers are larger because up to 512 tokens are processed simultaneously. In decode, only one token is processed at a time, hence much more compact buffers.


KV cache per layer

The KV cache is allocated in chunks of 256 tokens. No per-token reallocation -- a full chunk is allocated when the previous one is full.

Chunk sizing

A chunk covers 256 positions for all attention heads:

Concrete examples

For a sequence of length L:

SequenceChunksKV cache per layerTotal KV cache (28 layers)
5 tokensceil(5/256) = 1512 KB14 MB
8 tokens (5 + 3)ceil(8/256) = 1512 KB14 MB
300 tokensceil(300/256) = 21 MB28 MB
1024 tokensceil(1024/256) = 42 MB56 MB

With 5 tokens, a single chunk is allocated. Adding 3 tokens (for a total of 8) stays within the same chunk -- no new allocation. The chunk will only be extended when exceeding 256 tokens.

Why chunks

Chunk-based allocation in groups of 256 tokens avoids reallocating on every generated token. It is a tradeoff between memory fragmentation (a 512 KB chunk for 5 tokens is wasteful) and allocation cost (reallocating per token would be catastrophic). 256 is a good balance point: large enough to amortize allocations, small enough not to waste too much memory at the end of a chunk.


Summary

For an 8-token prompt on Qwen3 (28 layers, 1024 hidden, 8 KV heads x 64 dim):

ComponentSize
Model weights~1.3 GB
Runtime buffers (decode)~336 KB
KV cache (28 layers x 1 chunk)14 MB
Total~1.3 GB

The KV cache is negligible for short sequences. It starts to matter on long contexts (thousands of tokens). At 32,768 tokens (the model's limit), it would reach 28 x 128 x 512 KB = 1.75 GB -- almost as much as the model weights themselves.