The model that drops digits: picking an LLM architecture per agentic link
A few days ago I clicked a link my local assistant had produced. 404. The URL was a tweet: https://x.com/karpathy/status/204990382105354523. The original tweet was sitting one URL away: https://x.com/karpathy/status/2049903821095354523. The difference: eighteen digits in the first one, nineteen in the second. One digit had vanished from the middle of the identifier.
The assistant is mAIstrow, the desktop client I use every day, running on my own inference engine (herbert-rs) with Granite-4.1-30B in Q4 as the backbone. The web_fetch tool received the truncated URL and did exactly what it should have done: it called the URL and reported a 404.
First reaction: it's my code. I'm writing an inference engine from scratch, I have sampling, tokenizer and KV cache code under permanent debug pressure. A single digit lost in the middle of a string smells like a pointer error or a miswired sampler. I opened the bug bundle, looked at the trace: the model's response had already arrived truncated on the mAIstrow side. The sampler had nothing to apologize for. The model, at the end of generation, had literally written eighteen digits instead of nineteen.
Before the bench, a word on the title. When I say "agentic AI system," I mean a system that chains multiple calls to the model for a single user request (extracting an argument, calling a tool, synthesizing a result, following a format), not an orchestration of several agents collaborating. The article defends a single idea: the architecture of the model itself that dominates is not the same from one link to the next. What broke in mAIstrow was precisely the verbatim-copy link.
Minimal reproduction
To take herbert-rs out of the equation, I replayed the exact same prompt under llama-server (llama.cpp build b8680), with the same Q4_K_M quant of Granite-4.1-30B and temperature=0. Same answer. One digit short, in the same place.
So it isn't my code. It isn't stochastic sampling either: at temperature=0, the model is deterministic on this prompt. If I rerun it, I get the same wrong output. Every time.
The prompt itself is explicit to a level that's almost insulting to the model:
Extract the numeric status ID from this URL and return it as JSON in the form
{"id": "<digits>"}. Copy the digits EXACTLY: do not abbreviate, round, or omit any digit. Respond with the JSON only, no prose.
The URL sits in the context. Three tokens above. The model just has to point at it. It misses.
This isn't a knowledge hallucination: it's a copy mistake on a string that's already in the context. That category of error has a name and a literature.
The bench
I wrote a minimal bench in fifty lines of Python. Five lengths (8, 12, 16, 19, 22 digits), twenty random identifiers per length, the same prompt, a single user turn, temperature=0. Character-by-character comparison. And to avoid staying purely synthetic, I also took twenty-one real public tweet URLs (@tenstorrent, @elonmusk, @ggerganov, @karpathy) that anyone can click on to verify the expected ID.
Here is what comes out across seven models, all going through the same llama-server backend:
| Model | Architecture | Quant | Error (real URLs) |
|---|---|---|---|
| granite-4.1-30B | Mamba2-hybrid (36/4) | Q4_K_M | 24% (5/21) |
| granite-4.1-8B | Mamba2-hybrid | BF16 | 88% |
| granite-4.0-1B | Dense Transformer | BF16 | 0% |
| Qwen3-VL-2B | Dense Transformer | Q4_K_M | 0% |
| Qwen3.6-27B | DeltaNet hybrid 3:1 | Q4_K_M | 0% |
| LFM2-2.6B | Shortconv hybrid (~33% attn) | Q4_K_M | 0% |
| gpt-oss-20b | Dense Transformer (MoE) | F16 | 0% |
| gemma-4-26B-A4B | Dense Transformer (MoE) | Q4_K_M | 0% |
| SmolLM3-3B | Dense Transformer | Q4_K_M | 0% |
The pattern is impossible to miss. Everything Granite-4-H (the vanilla Mamba2 hybrids) fails. Everything else passes, regardless of size (2B to 27B), quantization (Q4 or BF16) or family.
The line that really isolates the variable is Granite-4.0-1B: the smallest model in the same family, in BF16, a classic dense Transformer (40 attention layers, zero Mamba). One hundred percent correct on the bench. Same tokenizer, same training pipeline, same team. The only difference: architecture.
On synthetic lengths, the Granite-4.1-30B profile follows a regular curve: 10% error at 8 digits, 50% at 16, 55% at 19, 30% at 22. Granite-4.1-8B is even worse, up to 100% on certain lengths. The smaller model in full precision misses more often than the larger model in Q4. Quantization isn't the problem.
Four falsified hypotheses
Before landing on "it's the architecture", I went through four hypotheses. None of them survived ten minutes of bench, but each was plausible when I posed it.
The tokenizer. Granite uses a numeric BPE in chunks of 1 to 3 digits (regex \p{N}{1,3}), like GPT-3.5 and GPT-4. Llama-3, Qwen3 and Gemma-4 sit on single-digit (\p{N}). We've known since Singh and Strouse 2024 (arXiv:2402.14903) that this choice degrades arithmetic. Hypothesis: it also degrades copy. Falsification: gpt-oss-20b uses exactly the same regex (verified in its tokenizer.json) and copies 100/100. The tokenizer alone doesn't explain a 36% error rate.
Quantization. Q4_K_M is aggressive. Maybe the dequantized weights lose too much precision to carry an exact nineteen-digit string. Falsification: Granite-4.1-8B in BF16, so without any quantization at all, does worse (88%) than Granite-4.1-30B in Q4 (36%). The smaller model in full precision misses more than the larger one in Q4.
Size. Maybe Granite is just a mid-tier model that's a bit weak on this task and 30B isn't enough capacity. Falsification: Granite-4.0-1B, the smallest in the family, lands at 0%. And Qwen3-VL-2B, half the size of Granite-4.0-1B, also lands at 0%. Size isn't the variable.
Training. Maybe the Granite family was tuned for instruction and RAG tasks, not for verbatim copy. Partial falsification: Granite-4.0-1B, with the exact same training pipeline, copies perfectly. So the family knows how to copy. It's the hybrid variants that can't.
There's one variable left I hadn't isolated: the architecture. Granite-4.0-1B is a pure dense Transformer. Granite-4.1-8B and 30B are Mamba2-hybrids with a ratio of 36 Mamba2 layers for 4 attention layers.
The theorem that predicts the bug
At this point I figured there had to be literature. There is. Three papers converge.
Jelassi, Brandfonbrener, Kakade, Malach (Harvard / Kempner Institute, ICML 2024). Repeat After Me: Transformers are Better than State Space Models at Copying (arXiv:2402.01032). The headline result is formal and strong:
"We prove two results. First, we show that a small Transformer can be used to copy extremely long sequences. Second, we show that any language model with fixed-size memory (i.e., any SSM) fails to copy long random strings."
Their evaluation task is called phonebook retrieval: the model is given a phonebook and asked for someone's number. That is strictly the same task as mine (extract a long numeric identifier from a context). The 88% error of Granite-4.1-8B is the empirical realization of their theorem.
The intuition is information-theoretic, not implementation-bound.
A Transformer's attention allocates memory of size O(n × d): every token in the context gets its own slot. Copying an identifier three tokens above is just pointing at the right slot.
A Mamba-style SSM has a fixed-size latent state (d_state=256 for Granite-4). Compressing a random nineteen-digit string into that state and then emitting it bit-perfect is bounded by the pigeonhole principle. The model has to learn to allocate its state precisely; that learning doesn't generalize beyond the training distribution. You don't fix this with better C++. It's a property of the architecture.
Yang, Kautz, Hatamizadeh (Dec 2024). Gated Delta Networks: Improving Mamba2 with Delta Rule (arXiv:2412.06464). This paper draws an important distinction: vanilla Mamba2 is known to be weak in retrieval. There's a variant (Gated DeltaNet) that patches the gap by combining a gating mechanism with a delta update rule. Qwen3-Next, Qwen3.5 and Qwen3.6 use Gated DeltaNet at a 3:1 ratio (75% DeltaNet, 25% attention). My bench confirms it: Qwen3.6-27B copies without errors.
And LFM2 from LiquidAI takes a third path: it interleaves its attention layers with short causal convolutions (shortconv, l_cache=3, so each conv only sees the last three tokens). These conv-layers don't try to do retrieval, they're too short for that. The attention layers (33% of the model) handle it. Result: 0% error on the same bench.
The reading that fits the four data points:
| Architecture | Non-attention component | Attempts retrieval? | Bench |
|---|---|---|---|
| Dense Transformer | (none) | (n/a) | ✅ |
| LFM2 (LiquidAI) | shortconv l_cache=3 | No, too short | ✅ |
| Qwen3-Next / 3.5 / 3.6 | Gated DeltaNet | Yes, designed for it | ✅ |
| Granite-4-H | Vanilla Mamba2 | Yes, but can't | ❌ |
"Hybrid = bad" is too coarse. The right summary: if the non-attention component tries to absorb retrieval, it had better be designed for it (DeltaNet); otherwise keep it shallow (LFM2) and let attention do its job. Granite-4-H falls into the gap: vanilla Mamba2 (the weakest SSM at recall) plus the lowest attention ratio (10%). The double-worst pick.
And IBM knew. From the official Granite documentation (ibm.com/granite/docs/models/granite), variants table:
Granite-4.0-Micro / 1B / 350M (Traditional, Dense): "Alternative option for users when Mamba2 support is not yet optimized (e.g. llama.cpp, PEFT, etc)"
IBM ships dense variants in the same family on purpose, because they know the hybrids break in current inference stacks. The phrasing talks about implementation optimization, but the 2024 Harvard result already predicted the problem independently of the implementation. I'd been running Granite-4-H as the main mAIstrow backbone for weeks. I had missed the note in the docs.
An agentic system is a chain of links
An agentic system is not a single LLM call. It's a chain of specialized calls passing results to one another. When the user says "summarize this Twitter thread and file it in my watch folder", a typical chain looks like:
- Intent understanding (the model picks which tool to call).
- Argument extraction (the URL of the thread, the folder name).
- An external tool call (
web_fetchgoing out to fetch the thread). - Compression / synthesis (the model summarizes what comes back).
- Format conformance (emit a structured note from a template).
- A final tool call to write the file.
Each of these links taps a different competence. Argument extraction needs verbatim copy. Synthesis needs reasoning and compression. Format conformance needs syntactic discipline. Tool calling needs precise knowledge of signatures and types.
These competences are not aligned with generic benchmarks. A model that dominates MMLU or GPQA can still fail at verbatim copy. Granite-4.1-30B posts honest reasoning scores and 24 to 88% error on copy. Conversely, a 2B model in Q4 that would look pale on GPQA is perfectly adequate to extract a numeric identifier from a thousand-token context.
The mistake I made, and that's very easy to make once you start industrializing, is taking one model and gluing it to every stage. Granite-4.1-30B is a solid generalist. It's competitive in reasoning, code and multilingual work. It collapses on copy. If argument extraction sits on the critical path of my system, I don't need a better prompt: I need a different model for that link.
When this discipline pays off
Sectioning a system per link has a cost: maintain several models, load them, route them, keep benches up to date. When does it pay off?
For me the criterion is simple: does the system make several calls of the same shape, several times a day, on inputs that resemble each other? If yes, we're in the zone where discipline returns more than it costs. If not, "one big model for everything" is more than reasonable.
In practice:
- Prototype, POC, market test. Don't bother. Take Claude Sonnet or a Gemini Pro and move on. If the system doesn't survive three weeks, the sectioning effort won't have amortized.
- Personal assistant used every day, chaining tool calls. This is the zone. Identify the two or three most frequent links, bench them on the real task, pick the model per link. The gain compounds: fewer errors, fewer retries, less latency on the trivial pieces.
- Product making thousands of calls a day in industrial use. Non-negotiable. At that volume, the cost of a 30B model on ID extractions is wasted compute; the cost of a model that's wrong one time in four is wasted retries and perceived reliability.
The bench method itself is trivial: take the real task (not an approximation), run it on three or four candidate models, measure what you want to measure (success rate, latency, cost). Fifty lines of Python is enough. The trap is never technical: it's cognitive, it's believing you know before you've measured. On this copy story, ten minutes before running the bench I would have sworn quantization or the tokenizer was the cause. Both were wrong.
For the other links
For the other links of an agentic system, here are the architectures that dominate public benchmarks, as pointers (no demonstration here):
- Reasoning and math: models with a "think" mode (GPT-OSS, Gemma 4, Nemotron Nano v2). Watch: GPQA Diamond, AIME.
- Pure tool calling: specialized models like xLAM-8B or Hammer-7B, which often beat larger generalists on signature and type matching. Watch: Berkeley Function Calling Leaderboard (BFCL).
- Long context beyond 256K: Mamba/MoE hybrids like Nemotron 3 Nano (RULER 86.3% at 1M, the best public result so far). Watch: RULER, not the value of
max_position_embeddings. - GUI agents: UI-TARS-7B, beating Claude on ScreenSpot at 7B and under Apache 2.0.
I keep a more complete list of usable open-weight models (commercial license, under 200B parameters, released after April 2024) at github.com/xigh/open-weight-models. The full verbatim-copy bench, with sources and detailed methodology, is at github.com/xigh/llm-copy-bench. herbert-rs works on the other end of the chain: running these models locally, in Rust and assembly, so you don't have to depend on an API.
The bug I caught by clicking a 404 link is a 2024 theorem I could have known about earlier. One more reason for a serious agentic assistant not to lean on a single model for all its links.