Why AI Slows Down as the Conversation Grows

You have probably noticed the phenomenon: at the start of a conversation with an AI, responses come almost instantly. But after about twenty messages, a few attached PDFs, and complex instructions, the wait grows noticeably. This is not a bug. It is a fundamental problem of context and computation.

Context: Everything the AI Keeps in Memory

The context is all the information the model takes into account to generate its response:

Some recent models can process up to 1 million tokens of context. But this capability comes at a cost.

More Context, More Latency

AI does not read your text like a human. It computes the relationships between every token in the context through the attention mechanism.

Consider a concrete example: at the beginning of a conversation, you mention that your cat is named Michel. Fifteen questions later, you ask "How can I make Michel friendlier?". The model must compare the word "Michel" with all previous words to understand that you are talking about your cat and not a person. And for each word in its response, it performs millions of computations across the entire context.

The result: computational complexity explodes with context size, and latency follows the same curve.

Caches Help, But Not Always

The KV cache, which stores already-computed Key and Value vectors, is supposed to mitigate this problem. But in practice, several factors limit its effectiveness:

Memory Purging

On shared infrastructure (like cloud services), the cache is regularly cleared to free up RAM or VRAM for other users. Your conversation then loses its cache, and the model must recompute everything.

Partial Content

In a "think then answer" format (Think > Answer), only the final answer is often kept in the cache. The reasoning phase, which can be lengthy, is discarded, making the cache partially invalid for subsequent exchanges.

No Persistence Between Sessions

Few services maintain the cache from one session to another. With each new connection, the model starts from scratch, even if your previous conversation was long and detailed.

Implications for Production Systems

These constraints are crucial for anyone building systems based on LLMs:

Mitigation Strategies

Several approaches can limit the impact:

Conclusion

LLMs can process enormous contexts, but they do not do so quickly, nor without memory constraints, nor in a stable manner. Understanding these limits is essential for designing reliable, performant, and scalable systems. The context window size advertised by a model is only part of the story: what truly matters is how that window is managed under real-world conditions.


Questions about this article or your own project? Book a consultation