Why AI Slows Down as the Conversation Grows
You have probably noticed the phenomenon: at the start of a conversation with an AI, responses come almost instantly. But after about twenty messages, a few attached PDFs, and complex instructions, the wait grows noticeably. This is not a bug. It is a fundamental problem of context and computation.
Context: Everything the AI Keeps in Memory
The context is all the information the model takes into account to generate its response:
- Your current query
- The conversation history
- Attached documents
- System instructions
Some recent models can process up to 1 million tokens of context. But this capability comes at a cost.
More Context, More Latency
AI does not read your text like a human. It computes the relationships between every token in the context through the attention mechanism.
Consider a concrete example: at the beginning of a conversation, you mention that your cat is named Michel. Fifteen questions later, you ask "How can I make Michel friendlier?". The model must compare the word "Michel" with all previous words to understand that you are talking about your cat and not a person. And for each word in its response, it performs millions of computations across the entire context.
The result: computational complexity explodes with context size, and latency follows the same curve.
Caches Help, But Not Always
The KV cache, which stores already-computed Key and Value vectors, is supposed to mitigate this problem. But in practice, several factors limit its effectiveness:
Memory Purging
On shared infrastructure (like cloud services), the cache is regularly cleared to free up RAM or VRAM for other users. Your conversation then loses its cache, and the model must recompute everything.
Partial Content
In a "think then answer" format (Think > Answer), only the final answer is often kept in the cache. The reasoning phase, which can be lengthy, is discarded, making the cache partially invalid for subsequent exchanges.
No Persistence Between Sessions
Few services maintain the cache from one session to another. With each new connection, the model starts from scratch, even if your previous conversation was long and detailed.
Implications for Production Systems
These constraints are crucial for anyone building systems based on LLMs:
- Variable latency: response time is not constant; it increases with context. This must be accounted for in user experience design.
- Memory management: the longer the context, the more GPU memory is required. For a model like Llama3-70B, the KV cache can reach 10.5 GB for 4k tokens.
- Inference cost: each token of context has a computational cost. Long conversations are significantly more expensive to process.
- Stability: response quality can degrade over very long contexts, even if the model is technically capable of processing them.
Mitigation Strategies
Several approaches can limit the impact:
- Context compression: summarizing previous exchanges to reduce the number of tokens without losing essential information.
- Sliding window: keeping only the last N messages, sacrificing older information.
- RAG (Retrieval-Augmented Generation): storing information in a vector database and loading only relevant passages for each query.
- KV cache optimization: techniques like PagedAttention or offloading to better manage memory.
Conclusion
LLMs can process enormous contexts, but they do not do so quickly, nor without memory constraints, nor in a stable manner. Understanding these limits is essential for designing reliable, performant, and scalable systems. The context window size advertised by a model is only part of the story: what truly matters is how that window is managed under real-world conditions.