How Does Multiplication Work at Tenstorrent?

In the optimization of my inference engine, I need to focus on an operation performed several billion times per token: multiplication.

Tenstorrent is a Canadian company specializing in high-performance processors for AI. It was founded by Ljubisa Bajic and Jim Keller (whom I greatly admire -- he has worked at Apple, AMD, Tesla, no less).

Their approach to multiplication is truly original.

Refresher: the IEEE754 Standard

The IEEE754 standard defines how to represent a floating-point number in 3 parts:

Sign: 1 bit, 0 indicates +, 1 indicates --
Exponent: encodes the power of 2 (with a bias)
Mantissa: represents the fractional part (normalized, with an implicit 1 before the decimal point)

For example, a 32-bit float:

1 bit for the sign
8 bits for the exponent (bias 127)
23 bits for the mantissa

Example: 0_10000001_01000000000000000000000

Sign = 0 --> positive
Exponent = 10000001 = 129 --> 129 - 127 = 2
Mantissa = 1.01 = 1.25
Value = 1.25 x 2^2 = 5.0

So Where Does Tenstorrent Fit In?

Traditionally, an FPU multiplies two floats by directly manipulating their mantissas --> precise, but expensive in transistors and therefore in energy.

At Tenstorrent, there is no direct floating-point multiplication. Instead: a very small 7-bit x 5-bit integer multiplier. The mantissas are split into blocks (7 and 5 bits), multiplied separately, then recombined (shifts + additions).

In doing so, they have the ability to adjust the precision of computations: Math Fidelity.

This is a slider that lets you choose between:

LoFi --> fast approximation (only the main blocks)
HiFi2, HiFi3 --> additional blocks are included --> more precision, more cycles

Exactly like a long multiplication where you choose to stop after 1, 2, or 3 steps depending on the speed / precision trade-off.

Why is this clever?

Dynamic adjustment of performance vs. precision
Energy savings: no need to pay the cost of a full fp16 if an fp8 or a LoFi computation is sufficient
Ultra-simple and compact hardware

And this fits perfectly with the evolution of current models (e.g., 4-bit quantization built into LLM design from the start).

In short: doing floating-point with small integers -- and letting the software choose the level of precision.

Small Language Models (SLMs): The Future of Agentic AI?

An article published in June 2025 by researchers at Nvidia aligns with our vision of AI: "Small Language Models are the Future of Agentic AI" (arXiv:2506.02153v1).

They argue that SLMs -- more compact, more energy-efficient, and more accessible than LLMs -- are particularly well suited to specific, structured, and repetitive tasks... in other words: ideal for AI agents.

Accessibility: less resource-hungry, they allow SMEs, local authorities, and independent developers to innovate without relying on costly infrastructure.
Flexibility: easy to adapt, compatible with hybrid architectures, and capable of collaborating with larger models when needed.
Impact: by reducing costs, energy consumption, and biases alike, they make AI more sustainable and more inclusive.

What if intelligence were no longer measured by the size of the model, but by its contextual efficiency?