MXFP4: Revolution or Evolution?

With the release of OpenAI's open-source GPT/OSS models in 20B and 120B parameters, many have been talking about the MXFP4 format.

Yet several of us were surprised by how efficiently this format is handled, even on machines without dedicated hardware.

Why Does It Work So Well on a MacStudio?

What is MXFP4? It is a computer representation of a floating-point number, optimized to minimize the memory footprint.

A quick reminder: the key to LLM performance is memory access speed.

However, high-end consumer GPUs only have 16 to 24 GB of memory, which creates a bottleneck for running a 20-billion-parameter model. Compression and optimization are therefore essential...

This format is presented as "lossless compression" for GPT-OSS models, and for now, only Nvidia offers hardware capable of natively supporting it.

The Software Implementation

Yet, thanks to a software solution, these models run perfectly on machines without this hardware feature. See the discussion on llama.cpp.

On the quality side, there is no impact. The question is about performance.

Benchmarks on MacStudio or Nvidia RTX 4080 are very encouraging. But what about more recent GPUs with the Blackwell architecture? At this point, it is hard to say.

The Real Limiting Factor

If we observe a significant speed gain with dedicated hardware, then hardware is truly the key factor. Otherwise, MXFP4 remains a good format, but not necessarily a revolution.

What is clearly an advancement, however, is that this type of number format is being used more and more. This confirms an intuition: the limiting factor is not raw computing power, but rather memory access.