Open-source LLMs in April 2026: landscape and observations

Changelog

2026-04-07: clarifications on NVIDIA's Nemotron lineup. Nemotron Nano 9B v2 is already covered in the reasoning section (/think mode), and Nemotron 3 Nano in the long context section (86.3% RULER at 1M). Adding here that the family also includes Nemotron 3 Super (120B MoE, multi-GPU server) and Llama-Nemotron-Super-49B (dense, derived from Llama 3.3 with the associated EU restrictions). More broadly, the open-weight-models repository contains a more comprehensive list than this article (theorem provers, GUI agents, search agents, tool calling, Rust specialists, etc.).

I'm building herbert-rs, a local LLM inference engine in Rust and hand-written assembly. To decide which models to support, I spent several weeks analyzing the open-source language models available today. Not all models: those that are commercially exploitable in Europe, under 200 billion parameters, and released after April 2024.

This article is a snapshot. Models evolve fast. Benchmarks too. But the underlying trends move slower, and those are what I'm most interested in here.

Selection criteria

Three simple filters:

Commercially exploitable license, no geographic restriction (EU ok)
Size < 200B total parameters
Less than 2 years old (released after April 2024)

This eliminates Llama 4 (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and a few others. The reasons are detailed at the end of this article.

Generalists

Models that do a bit of everything: reasoning, code, instruction following, multilingual.

Model	Publisher	Active	Total	Architecture	Ctx	License
Gemma 4 31B	Google	31B	31B	Dense	256K	Apache 2.0
Qwen3.5-27B	Alibaba	27B	27B	Dense	128K	Apache 2.0
Qwen3.5-9B	Alibaba	9B	9B	Dense	128K	Apache 2.0
Qwen3.5-122B-A10B	Alibaba	10B	122B	MoE	256K	Apache 2.0
GPT-OSS-120B	OpenAI	5.1B	117B	MoE	128K	Apache 2.0
GPT-OSS-20B	OpenAI	3.6B	21B	MoE	128K	Apache 2.0
Mistral Small 4	Mistral	6B	119B	MoE	256K	Apache 2.0
GLM-4.5-Air	Zhipu AI	12B	106B	MoE	128K	MIT
Llama 3.3 70B	Meta	70B	70B	Dense	128K	Llama Community (EU OK)
InternVL3-78B	Shanghai AI Lab	78B	78B	Dense	--	Apache 2.0

Reasoning (GPQA Diamond)

This benchmark is the most discriminating: 198 doctoral-level questions, impossible to solve by simple retrieval.

Model	GPQA Diamond	Active params
Gemini 3.1 Pro (closed)	94.3	--
GPT-5.4 (closed)	92.8	--
Claude Opus 4.6 (closed)	91.3	--
Gemma 4 31B	84.3	31B
Qwen3.5-9B	81.7	9B
GPT-OSS-120B	80.9	5.1B
GLM-4.5-Air	75.0	12B
Mistral Small 4	71.2	6B
Llama 3.3 70B	50.5	70B

Qwen3.5-9B at 81.7 with only 9 billion parameters. That's the most surprising number in this review.

Code

Model	SWE-bench	Codeforces	Active	License
Claude Opus 4.6 (closed)	80.8%	--	--	--
Gemini 3.1 Pro (closed)	80.6%	--	--	--
GPT-5.4 (closed)	~80%	--	--	--
Step-3.5-Flash	74.4%	--	11B	Apache 2.0
Devstral Small 2	68.0%	--	24B	Apache 2.0
GPT-OSS-120B	62.4%	2622	5.1B	Apache 2.0
Gemma 4 31B	--	2150	31B	Apache 2.0

SWE-bench measures the ability to fix real bugs in existing codebases. Codeforces measures pure algorithmic skill. These are not the same: GPT-OSS-120B dominates in competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).

Specialized reasoning

Model	Specialty	Key score	Active	License
QwQ-32B	Reasoning RL	AIME ~80%	32B	Apache 2.0
DeepSeek R1-Distill-32B	Distilled reasoning	beats o1-mini	32B	MIT
Nemotron Nano 9B v2	Math + /think control	MATH-500 97.8%	9B	Nemotron OML

Nemotron Nano 9B v2 has an interesting feature: /think and /no_think modes let you control the reasoning budget per request. An agent can think hard on a math problem and respond instantly to a simple question. This is a production feature, not a gimmick.

Compact models (< 8 GB)

For edge, mobile, or laptops with limited RAM.

Model	Active	VRAM Q4	Strength	License
SmolLM3-3B	3B	~2 GB	Best 3B, AIME 36.7%, /think mode, 64K ctx	Apache 2.0
SmolLM2-1.7B	1.7B	~1 GB	11T tokens, data-centric	Apache 2.0
SmolLM2-135M	135M	< 1 GB	Ultra-compact, few MB quantized	Apache 2.0
Gemma 4 E2B	2.3B	~4 GB	Multimodal + audio	Apache 2.0
Gemma 4 E4B	4.5B	~6 GB	Multimodal + audio	Apache 2.0
Phi-4	3.8B-14B	2-8 GB	Math, trimodal (5.6B)	MIT
Ministral 3B/8B/14B	3-14B	2-8 GB	Vision + reasoning	Apache 2.0
LFM2.5-1.2B	1.2B	~1 GB	IFBench 47.3 (2x Qwen3-1.7B), thinking mode, vision, audio	LFM Open v1.0
Llama 3.2 1B/3B	1-3B	< 2 GB	128K ctx, edge/mobile, EU OK (text-only)	Llama Community
InternLM3-8B	8B	~5 GB	Thinking mode, 4T tokens (75% less than competition)	Apache 2.0
InternVL3-1B→38B	1-38B	1-20 GB	Vision SOTA, full range edge→server	Apache 2.0

HuggingFace's SmolLM3-3B beats all other 3B models and competes with 4B ones. SmolLM2's data-centric approach shows that data quality matters more than model size: the 1.7B trained on 11T tokens beats larger models trained on less data.

Ministral 14B at 85% on AIME 2025 for a 14B dense model is remarkable. And it fits in 8 GB quantized.

Long context and alternative architectures

Model	Max ctx	RULER 1M	Architecture	Active	License
Nemotron 3 Nano	1M	86.3%	Mamba/MoE	3.5B	Nemotron OML
Granite 4.0	--	--	90% Mamba-2 / 10% Attention	3-9B	Apache 2.0
LFM2/2.5	32K	--	Convolutions + grouped attention	2.3B	LFM Open v1.0

Nemotron 3 Nano is the long context champion: 86.3% on RULER at 1 million tokens, with only 3.5B active parameters. Mamba's linear complexity gives it a structural advantage over pure Transformers here.

But beware: many models advertise "1M context" without publishing a RULER score at that length. Without measurement, it's marketing.

Observations

What follows is not a list of definitive truths. These are patterns I observed while analyzing these models. They deserve to be verified over time.

Dense retreats above 35B, but doesn't die

For generalists above 35B, MoE (Mixture of Experts) clearly dominates: GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super... all MoE. The quality/compute ratio has become too favorable. But dense survives where it has a structural advantage: Llama 3.3 70B (generalist, MMLU 86.0), InternVL3-78B (vision, MMMU 72.2), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice, no longer the default.

Parameter count is no longer the determining factor

Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond. Architecture, training method (distillation + multi-agent RL), and data quality matter more than raw size.

Qwen has become the de facto base model

BFS-Prover (base Qwen2.5-32B), Goedel-Prover (base Qwen3-32B), Kimina-Prover (base Qwen2.5-72B), most community distillations: everything builds on Qwen. It's the equivalent of what ResNet was for transfer learning in vision a decade ago.

InternVL3 is the best open-source VLM nobody was talking about

InternVL3-78B (Shanghai AI Lab) reaches 72.2 on MMMU — on par with GPT-4o — under Apache 2.0. With a range from 1B to 78B, it's the direct competitor to Gemma 4 for multimodal. And InternLM3-8B proves you can reach SOTA with 75% fewer training tokens (4T instead of 15-18T). The lab gets less press than Alibaba, but the results speak.

The 40-79B segment is the dense survivors' refuge

New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is still well populated by quality dense models: Llama 3.3 70B (Dec 2024), InternVL3-78B (Apr 2025), Kimina-Prover-72B (Apr 2025), Qwen 2.5-72B, R1-Distill-70B, Jamba 1.6 Mini 52B. This is where dense resists, and where you find both solid generalists and specialists (vision, theorem proving, math).

Step-3.5-Flash is the Swiss Army knife

It appears in 4 categories (code, generalist, agents, speed): SWE-bench 74.4%, 350 tok/s, and some of the best agent scores. If you could only deploy one model on a multi-GPU server, it's probably the most versatile.

GPT-OSS-120B has the best active-params-to-performance ratio

5.1B active parameters for a Codeforces ELO of 2622 and 96.6% on AIME. The most efficient model in the landscape for coding and math.

Licenses: don't overlook this

Most models listed here are under Apache 2.0: free commercial use, no geographic restriction, patent grant included, irrevocable license. Same license as TensorFlow or Kubernetes.

Notable exceptions:

License	Models	Status
Apache 2.0	Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash	Exploitable everywhere
MIT	GLM-4.5-Air, DeepSeek R1-Distill, Phi-4	Exploitable everywhere
Nemotron OML	Nemotron 3 Nano/Super	Exploitable (custom, royalty-free, not OSI)
Llama Community	Llama 3.3 70B, Llama 3.2 1B/3B (text-only)	EU OK (700M MAU threshold)
LFM Open v1.0	LFM2, LFM2.5	Exploitable < $10M revenue

Gemma 4 under Apache 2.0 is a turning point. Google previously used a restrictive custom license (Gemma Terms of Use). The switch to Apache 2.0 aligns Gemma with the rest of the open-source ecosystem.

Rejected models

Llama 4 / Llama 3.2 Vision (Meta): the license explicitly excludes EU-domiciled entities for multimodal models. Text-only models (Llama 3.3 70B, Llama 3.2 1B/3B) are EU-exploitable.
Qwen 3.6 Plus (Alibaba): closed-source, API-only. A step back from Qwen 3/3.5 which were Apache 2.0.
Full DeepSeek V3/R1 (671B): above the 200B threshold.

How to choose

Constraint	Recommendation
Smartphone / edge (< 4 GB)	Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B
Laptop 16 GB	GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B
Desktop 24 GB	Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2
Desktop 48+ GB (dense 70B)	Llama 3.3 70B (MMLU 86.0, HumanEval 88.4, EU OK)
Server single-GPU	GPT-OSS-120B
Server multi-GPU	Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B
Long context (> 256K)	Nemotron 3 Nano
Math	Nemotron Nano 9B v2 (with /think), GPT-OSS-120B
Code (real bugs)	Step-3.5-Flash, Devstral Small 2
Multilingual (> 100 languages)	Qwen 3.5 (201 languages), Qwen 3 (119 languages)

What's next

This overview covers text LLMs. More articles will follow on specialized models: embedding and retrieval, speech recognition, text-to-speech, image generation, theorem provers (Lean 4), and GUI agents.

The data in this article comes from a systematic review of over 60 models, with benchmark and license verification against primary sources (papers, HuggingFace, official repositories). The public reference for all 71 benchmarks (with paper, dataset and leaderboard links) is available at github.com/xigh/open-weight-models.

If you spot an error or a missing model, get in touch.

Questions about this article or your own project? Book a consultation