Open-source LLMs in April 2026: landscape and observations
Changelog
2026-04-07: clarifications on NVIDIA's Nemotron lineup. Nemotron Nano 9B v2 is already covered in the reasoning section (
/thinkmode), and Nemotron 3 Nano in the long context section (86.3% RULER at 1M). Adding here that the family also includes Nemotron 3 Super (120B MoE, multi-GPU server) and Llama-Nemotron-Super-49B (dense, derived from Llama 3.3 with the associated EU restrictions). More broadly, the open-weight-models repository contains a more comprehensive list than this article (theorem provers, GUI agents, search agents, tool calling, Rust specialists, etc.).
I'm building herbert-rs, a local LLM inference engine in Rust and hand-written assembly. To decide which models to support, I spent several weeks analyzing the open-source language models available today. Not all models: those that are commercially exploitable in Europe, under 200 billion parameters, and released after April 2024.
This article is a snapshot. Models evolve fast. Benchmarks too. But the underlying trends move slower, and those are what I'm most interested in here.
Selection criteria
Three simple filters:
- Commercially exploitable license, no geographic restriction (EU ok)
- Size < 200B total parameters
- Less than 2 years old (released after April 2024)
This eliminates Llama 4 (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and a few others. The reasons are detailed at the end of this article.
Generalists
Models that do a bit of everything: reasoning, code, instruction following, multilingual.
| Model | Publisher | Active | Total | Architecture | Ctx | License |
|---|---|---|---|---|---|---|
| Gemma 4 31B | 31B | 31B | Dense | 256K | Apache 2.0 | |
| Qwen3.5-27B | Alibaba | 27B | 27B | Dense | 128K | Apache 2.0 |
| Qwen3.5-9B | Alibaba | 9B | 9B | Dense | 128K | Apache 2.0 |
| Qwen3.5-122B-A10B | Alibaba | 10B | 122B | MoE | 256K | Apache 2.0 |
| GPT-OSS-120B | OpenAI | 5.1B | 117B | MoE | 128K | Apache 2.0 |
| GPT-OSS-20B | OpenAI | 3.6B | 21B | MoE | 128K | Apache 2.0 |
| Mistral Small 4 | Mistral | 6B | 119B | MoE | 256K | Apache 2.0 |
| GLM-4.5-Air | Zhipu AI | 12B | 106B | MoE | 128K | MIT |
| Llama 3.3 70B | Meta | 70B | 70B | Dense | 128K | Llama Community (EU OK) |
| InternVL3-78B | Shanghai AI Lab | 78B | 78B | Dense | -- | Apache 2.0 |
Reasoning (GPQA Diamond)
This benchmark is the most discriminating: 198 doctoral-level questions, impossible to solve by simple retrieval.
| Model | GPQA Diamond | Active params |
|---|---|---|
| Gemini 3.1 Pro (closed) | 94.3 | -- |
| GPT-5.4 (closed) | 92.8 | -- |
| Claude Opus 4.6 (closed) | 91.3 | -- |
| Gemma 4 31B | 84.3 | 31B |
| Qwen3.5-9B | 81.7 | 9B |
| GPT-OSS-120B | 80.9 | 5.1B |
| GLM-4.5-Air | 75.0 | 12B |
| Mistral Small 4 | 71.2 | 6B |
| Llama 3.3 70B | 50.5 | 70B |
Qwen3.5-9B at 81.7 with only 9 billion parameters. That's the most surprising number in this review.
Code
| Model | SWE-bench | Codeforces | Active | License |
|---|---|---|---|---|
| Claude Opus 4.6 (closed) | 80.8% | -- | -- | -- |
| Gemini 3.1 Pro (closed) | 80.6% | -- | -- | -- |
| GPT-5.4 (closed) | ~80% | -- | -- | -- |
| Step-3.5-Flash | 74.4% | -- | 11B | Apache 2.0 |
| Devstral Small 2 | 68.0% | -- | 24B | Apache 2.0 |
| GPT-OSS-120B | 62.4% | 2622 | 5.1B | Apache 2.0 |
| Gemma 4 31B | -- | 2150 | 31B | Apache 2.0 |
SWE-bench measures the ability to fix real bugs in existing codebases. Codeforces measures pure algorithmic skill. These are not the same: GPT-OSS-120B dominates in competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).
Specialized reasoning
| Model | Specialty | Key score | Active | License |
|---|---|---|---|---|
| QwQ-32B | Reasoning RL | AIME ~80% | 32B | Apache 2.0 |
| DeepSeek R1-Distill-32B | Distilled reasoning | beats o1-mini | 32B | MIT |
| Nemotron Nano 9B v2 | Math + /think control | MATH-500 97.8% | 9B | Nemotron OML |
Nemotron Nano 9B v2 has an interesting feature: /think and /no_think modes let you control the reasoning budget per request. An agent can think hard on a math problem and respond instantly to a simple question. This is a production feature, not a gimmick.
Compact models (< 8 GB)
For edge, mobile, or laptops with limited RAM.
| Model | Active | VRAM Q4 | Strength | License |
|---|---|---|---|---|
| SmolLM3-3B | 3B | ~2 GB | Best 3B, AIME 36.7%, /think mode, 64K ctx | Apache 2.0 |
| SmolLM2-1.7B | 1.7B | ~1 GB | 11T tokens, data-centric | Apache 2.0 |
| SmolLM2-135M | 135M | < 1 GB | Ultra-compact, few MB quantized | Apache 2.0 |
| Gemma 4 E2B | 2.3B | ~4 GB | Multimodal + audio | Apache 2.0 |
| Gemma 4 E4B | 4.5B | ~6 GB | Multimodal + audio | Apache 2.0 |
| Phi-4 | 3.8B-14B | 2-8 GB | Math, trimodal (5.6B) | MIT |
| Ministral 3B/8B/14B | 3-14B | 2-8 GB | Vision + reasoning | Apache 2.0 |
| LFM2.5-1.2B | 1.2B | ~1 GB | IFBench 47.3 (2x Qwen3-1.7B), thinking mode, vision, audio | LFM Open v1.0 |
| Llama 3.2 1B/3B | 1-3B | < 2 GB | 128K ctx, edge/mobile, EU OK (text-only) | Llama Community |
| InternLM3-8B | 8B | ~5 GB | Thinking mode, 4T tokens (75% less than competition) | Apache 2.0 |
| InternVL3-1B→38B | 1-38B | 1-20 GB | Vision SOTA, full range edge→server | Apache 2.0 |
HuggingFace's SmolLM3-3B beats all other 3B models and competes with 4B ones. SmolLM2's data-centric approach shows that data quality matters more than model size: the 1.7B trained on 11T tokens beats larger models trained on less data.
Ministral 14B at 85% on AIME 2025 for a 14B dense model is remarkable. And it fits in 8 GB quantized.
Long context and alternative architectures
| Model | Max ctx | RULER 1M | Architecture | Active | License |
|---|---|---|---|---|---|
| Nemotron 3 Nano | 1M | 86.3% | Mamba/MoE | 3.5B | Nemotron OML |
| Granite 4.0 | -- | -- | 90% Mamba-2 / 10% Attention | 3-9B | Apache 2.0 |
| LFM2/2.5 | 32K | -- | Convolutions + grouped attention | 2.3B | LFM Open v1.0 |
Nemotron 3 Nano is the long context champion: 86.3% on RULER at 1 million tokens, with only 3.5B active parameters. Mamba's linear complexity gives it a structural advantage over pure Transformers here.
But beware: many models advertise "1M context" without publishing a RULER score at that length. Without measurement, it's marketing.
Observations
What follows is not a list of definitive truths. These are patterns I observed while analyzing these models. They deserve to be verified over time.
Dense retreats above 35B, but doesn't die
For generalists above 35B, MoE (Mixture of Experts) clearly dominates: GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super... all MoE. The quality/compute ratio has become too favorable. But dense survives where it has a structural advantage: Llama 3.3 70B (generalist, MMLU 86.0), InternVL3-78B (vision, MMMU 72.2), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice, no longer the default.
Parameter count is no longer the determining factor
Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond. Architecture, training method (distillation + multi-agent RL), and data quality matter more than raw size.
Qwen has become the de facto base model
BFS-Prover (base Qwen2.5-32B), Goedel-Prover (base Qwen3-32B), Kimina-Prover (base Qwen2.5-72B), most community distillations: everything builds on Qwen. It's the equivalent of what ResNet was for transfer learning in vision a decade ago.
InternVL3 is the best open-source VLM nobody was talking about
InternVL3-78B (Shanghai AI Lab) reaches 72.2 on MMMU — on par with GPT-4o — under Apache 2.0. With a range from 1B to 78B, it's the direct competitor to Gemma 4 for multimodal. And InternLM3-8B proves you can reach SOTA with 75% fewer training tokens (4T instead of 15-18T). The lab gets less press than Alibaba, but the results speak.
The 40-79B segment is the dense survivors' refuge
New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is still well populated by quality dense models: Llama 3.3 70B (Dec 2024), InternVL3-78B (Apr 2025), Kimina-Prover-72B (Apr 2025), Qwen 2.5-72B, R1-Distill-70B, Jamba 1.6 Mini 52B. This is where dense resists, and where you find both solid generalists and specialists (vision, theorem proving, math).
Step-3.5-Flash is the Swiss Army knife
It appears in 4 categories (code, generalist, agents, speed): SWE-bench 74.4%, 350 tok/s, and some of the best agent scores. If you could only deploy one model on a multi-GPU server, it's probably the most versatile.
GPT-OSS-120B has the best active-params-to-performance ratio
5.1B active parameters for a Codeforces ELO of 2622 and 96.6% on AIME. The most efficient model in the landscape for coding and math.
Licenses: don't overlook this
Most models listed here are under Apache 2.0: free commercial use, no geographic restriction, patent grant included, irrevocable license. Same license as TensorFlow or Kubernetes.
Notable exceptions:
| License | Models | Status |
|---|---|---|
| Apache 2.0 | Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash | Exploitable everywhere |
| MIT | GLM-4.5-Air, DeepSeek R1-Distill, Phi-4 | Exploitable everywhere |
| Nemotron OML | Nemotron 3 Nano/Super | Exploitable (custom, royalty-free, not OSI) |
| Llama Community | Llama 3.3 70B, Llama 3.2 1B/3B (text-only) | EU OK (700M MAU threshold) |
| LFM Open v1.0 | LFM2, LFM2.5 | Exploitable < $10M revenue |
Gemma 4 under Apache 2.0 is a turning point. Google previously used a restrictive custom license (Gemma Terms of Use). The switch to Apache 2.0 aligns Gemma with the rest of the open-source ecosystem.
Rejected models
- Llama 4 / Llama 3.2 Vision (Meta): the license explicitly excludes EU-domiciled entities for multimodal models. Text-only models (Llama 3.3 70B, Llama 3.2 1B/3B) are EU-exploitable.
- Qwen 3.6 Plus (Alibaba): closed-source, API-only. A step back from Qwen 3/3.5 which were Apache 2.0.
- Full DeepSeek V3/R1 (671B): above the 200B threshold.
How to choose
| Constraint | Recommendation |
|---|---|
| Smartphone / edge (< 4 GB) | Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B |
| Laptop 16 GB | GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B |
| Desktop 24 GB | Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2 |
| Desktop 48+ GB (dense 70B) | Llama 3.3 70B (MMLU 86.0, HumanEval 88.4, EU OK) |
| Server single-GPU | GPT-OSS-120B |
| Server multi-GPU | Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B |
| Long context (> 256K) | Nemotron 3 Nano |
| Math | Nemotron Nano 9B v2 (with /think), GPT-OSS-120B |
| Code (real bugs) | Step-3.5-Flash, Devstral Small 2 |
| Multilingual (> 100 languages) | Qwen 3.5 (201 languages), Qwen 3 (119 languages) |
What's next
This overview covers text LLMs. More articles will follow on specialized models: embedding and retrieval, speech recognition, text-to-speech, image generation, theorem provers (Lean 4), and GUI agents.
The data in this article comes from a systematic review of over 60 models, with benchmark and license verification against primary sources (papers, HuggingFace, official repositories). The public reference for all 71 benchmarks (with paper, dataset and leaderboard links) is available at github.com/xigh/open-weight-models.
If you spot an error or a missing model, get in touch.