How Does Multiplication Work at Tenstorrent?

In the optimization of my inference engine, I need to focus on an operation performed several billion times per token: multiplication.

Tenstorrent is a Canadian company specializing in high-performance processors for AI. It was founded by Ljubisa Bajic and Jim Keller (whom I greatly admire -- he has worked at Apple, AMD, Tesla, no less).

Their approach to multiplication is truly original.

Refresher: the IEEE754 Standard

The IEEE754 standard defines how to represent a floating-point number in 3 parts:

For example, a 32-bit float:

Example: 0_10000001_01000000000000000000000

So Where Does Tenstorrent Fit In?

Traditionally, an FPU multiplies two floats by directly manipulating their mantissas --> precise, but expensive in transistors and therefore in energy.

At Tenstorrent, there is no direct floating-point multiplication. Instead: a very small 7-bit x 5-bit integer multiplier. The mantissas are split into blocks (7 and 5 bits), multiplied separately, then recombined (shifts + additions).

In doing so, they have the ability to adjust the precision of computations: Math Fidelity.

This is a slider that lets you choose between:

Exactly like a long multiplication where you choose to stop after 1, 2, or 3 steps depending on the speed / precision trade-off.

Why is this clever?

And this fits perfectly with the evolution of current models (e.g., 4-bit quantization built into LLM design from the start).

In short: doing floating-point with small integers -- and letting the software choose the level of precision.


Small Language Models (SLMs): The Future of Agentic AI?

An article published in June 2025 by researchers at Nvidia aligns with our vision of AI: "Small Language Models are the Future of Agentic AI" (arXiv:2506.02153v1).

They argue that SLMs -- more compact, more energy-efficient, and more accessible than LLMs -- are particularly well suited to specific, structured, and repetitive tasks... in other words: ideal for AI agents.

What if intelligence were no longer measured by the size of the model, but by its contextual efficiency?