Qwen3 vs Qwen3.5: What Actually Changes
Qwen3.5 is not a simple scaling. The architecture changes fundamentally. Here is what matters when you are writing an inference engine.
Hybrid attention: the real change
Qwen3 uses standard softmax attention in every layer. Qwen3.5 introduces a hybrid design: 75% of layers use a new mechanism called Gated DeltaNet.
The pattern: out of 4 consecutive layers, 3 use DeltaNet, 1 uses standard attention. This block repeats 12 or 15 times depending on the model.
Gated DeltaNet: how it works
Standard attention stores all past K and V vectors in a cache that grows linearly with sequence length. At 32K tokens, this cache dominates memory and bandwidth.
Gated DeltaNet replaces this cache with a compressed state matrix S, fixed in size, updated at each token with a delta rule:
S(t) = beta(t) . S(t-1) + alpha(t) . k(t) . v(t)^T
- beta(t) is a forget gate: how much of the previous state to keep
- alpha(t) is an input gate: how much of the new token to integrate
- Both are learned functions of the input, not constants
The output is simply: o(t) = S(t)^T . q(t)
Concretely, for Qwen3-Next-80B, the state S is 16 x 128 x 32 x 128 = 32 MB per layer in f32. This is fixed: whether the sequence is 1K or 256K tokens, the size does not change. Compare this with a standard KV cache that grows to several GB on long contexts.
Causal Conv1D
Before the DeltaNet computation, the input goes through a causal 1D convolution with kernel size 4. This mechanism (similar to Mamba's) gives each position a local context of 4 tokens to compute the alpha and beta gates. This is what allows the recurrence to "know" when to forget and when to retain.
Parallel prefill
The recurrent formula seems to impose sequential processing, but in practice the prefill can be parallelized with a parallel scan (same technique as Mamba-2). During token-by-token decode, it is a simple update of S in O(d_k . d_v) per layer -- much faster than attending over the entire sequence.
Why not pure linear?
Pure linear attention loses precise token-to-token lookup (the "needle in a haystack" problem). The state matrix S compresses information, so it can miss a specific token buried in a long context.
Solution: full attention layers every 4 steps serve as exact retrieval checkpoints. DeltaNet handles efficient long-range propagation, standard attention corrects precisely when needed.
Claimed result: 8.6x faster decoding at 32K tokens, 19x at 256K, 1M+ token context with constant memory.
MoE: more experts, less activation
| Qwen3-30B | Qwen3.5-397B | |
|---|---|---|
| Total experts | 128 | 512 |
| Active experts/token | 8 | 10 |
| Expert size | 768 | 1024 |
| Shared expert | No | Yes |
| Activation ratio | 6.25% | 1.95% |
4x more experts, but the activation ratio drops to 1.95%. Each expert is more finely specialized. A shared expert (always active) guarantees baseline FFN capacity regardless of routing.
Partial RoPE
In Qwen3, RoPE applies to all dimensions (128 dims, 64 pairs). In Qwen3.5, only 25% of the 256 dims are rotary-encoded. The number of rotary dims stays at 64, but 192 dims are position-free. Same idea as GPT-NeoX / Phi.
Engine impact
The only truly new primitive is the Gated DeltaNet layer. Everything else is scaling or configuration.
What is trivial: larger vocabulary, attention output gate, YaRN scaling.
What takes work: recurrent state per layer, parallel scan for prefill, causal Conv1D (kernel=4), gating logic. The full attention layers (1 out of 4) are what we already have, with wider heads and partial RoPE.
The bottom line: the KV cache, which is the main wall in long-context inference, disappears for 75% of layers. This is a fundamental change for local inference.