Open-source LLMs in April 2026: landscape and observations

Changelog

2026-04-07: clarifications on NVIDIA's Nemotron lineup. Nemotron Nano 9B v2 is already covered in the reasoning section (/think mode), and Nemotron 3 Nano in the long context section (86.3% RULER at 1M). Adding here that the family also includes Nemotron 3 Super (120B MoE, multi-GPU server) and Llama-Nemotron-Super-49B (dense, derived from Llama 3.3 with the associated EU restrictions). More broadly, the open-weight-models repository contains a more comprehensive list than this article (theorem provers, GUI agents, search agents, tool calling, Rust specialists, etc.).

I'm building herbert-rs, a local LLM inference engine in Rust and hand-written assembly. To decide which models to support, I spent several weeks analyzing the open-source language models available today. Not all models: those that are commercially exploitable in Europe, under 200 billion parameters, and released after April 2024.

This article is a snapshot. Models evolve fast. Benchmarks too. But the underlying trends move slower, and those are what I'm most interested in here.

Selection criteria

Three simple filters:

  1. Commercially exploitable license, no geographic restriction (EU ok)
  2. Size < 200B total parameters
  3. Less than 2 years old (released after April 2024)

This eliminates Llama 4 (EU exclusion), Qwen 3.6 Plus (closed-source), full DeepSeek V3/R1 (671B), and a few others. The reasons are detailed at the end of this article.


Generalists

Models that do a bit of everything: reasoning, code, instruction following, multilingual.

ModelPublisherActiveTotalArchitectureCtxLicense
Gemma 4 31BGoogle31B31BDense256KApache 2.0
Qwen3.5-27BAlibaba27B27BDense128KApache 2.0
Qwen3.5-9BAlibaba9B9BDense128KApache 2.0
Qwen3.5-122B-A10BAlibaba10B122BMoE256KApache 2.0
GPT-OSS-120BOpenAI5.1B117BMoE128KApache 2.0
GPT-OSS-20BOpenAI3.6B21BMoE128KApache 2.0
Mistral Small 4Mistral6B119BMoE256KApache 2.0
GLM-4.5-AirZhipu AI12B106BMoE128KMIT
Llama 3.3 70BMeta70B70BDense128KLlama Community (EU OK)
InternVL3-78BShanghai AI Lab78B78BDense--Apache 2.0

Reasoning (GPQA Diamond)

This benchmark is the most discriminating: 198 doctoral-level questions, impossible to solve by simple retrieval.

ModelGPQA DiamondActive params
Gemini 3.1 Pro (closed)94.3--
GPT-5.4 (closed)92.8--
Claude Opus 4.6 (closed)91.3--
Gemma 4 31B84.331B
Qwen3.5-9B81.79B
GPT-OSS-120B80.95.1B
GLM-4.5-Air75.012B
Mistral Small 471.26B
Llama 3.3 70B50.570B

Qwen3.5-9B at 81.7 with only 9 billion parameters. That's the most surprising number in this review.


Code

ModelSWE-benchCodeforcesActiveLicense
Claude Opus 4.6 (closed)80.8%------
Gemini 3.1 Pro (closed)80.6%------
GPT-5.4 (closed)~80%------
Step-3.5-Flash74.4%--11BApache 2.0
Devstral Small 268.0%--24BApache 2.0
GPT-OSS-120B62.4%26225.1BApache 2.0
Gemma 4 31B--215031BApache 2.0

SWE-bench measures the ability to fix real bugs in existing codebases. Codeforces measures pure algorithmic skill. These are not the same: GPT-OSS-120B dominates in competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).


Specialized reasoning

ModelSpecialtyKey scoreActiveLicense
QwQ-32BReasoning RLAIME ~80%32BApache 2.0
DeepSeek R1-Distill-32BDistilled reasoningbeats o1-mini32BMIT
Nemotron Nano 9B v2Math + /think controlMATH-500 97.8%9BNemotron OML

Nemotron Nano 9B v2 has an interesting feature: /think and /no_think modes let you control the reasoning budget per request. An agent can think hard on a math problem and respond instantly to a simple question. This is a production feature, not a gimmick.


Compact models (< 8 GB)

For edge, mobile, or laptops with limited RAM.

ModelActiveVRAM Q4StrengthLicense
SmolLM3-3B3B~2 GBBest 3B, AIME 36.7%, /think mode, 64K ctxApache 2.0
SmolLM2-1.7B1.7B~1 GB11T tokens, data-centricApache 2.0
SmolLM2-135M135M< 1 GBUltra-compact, few MB quantizedApache 2.0
Gemma 4 E2B2.3B~4 GBMultimodal + audioApache 2.0
Gemma 4 E4B4.5B~6 GBMultimodal + audioApache 2.0
Phi-43.8B-14B2-8 GBMath, trimodal (5.6B)MIT
Ministral 3B/8B/14B3-14B2-8 GBVision + reasoningApache 2.0
LFM2.5-1.2B1.2B~1 GBIFBench 47.3 (2x Qwen3-1.7B), thinking mode, vision, audioLFM Open v1.0
Llama 3.2 1B/3B1-3B< 2 GB128K ctx, edge/mobile, EU OK (text-only)Llama Community
InternLM3-8B8B~5 GBThinking mode, 4T tokens (75% less than competition)Apache 2.0
InternVL3-1B→38B1-38B1-20 GBVision SOTA, full range edge→serverApache 2.0

HuggingFace's SmolLM3-3B beats all other 3B models and competes with 4B ones. SmolLM2's data-centric approach shows that data quality matters more than model size: the 1.7B trained on 11T tokens beats larger models trained on less data.

Ministral 14B at 85% on AIME 2025 for a 14B dense model is remarkable. And it fits in 8 GB quantized.


Long context and alternative architectures

ModelMax ctxRULER 1MArchitectureActiveLicense
Nemotron 3 Nano1M86.3%Mamba/MoE3.5BNemotron OML
Granite 4.0----90% Mamba-2 / 10% Attention3-9BApache 2.0
LFM2/2.532K--Convolutions + grouped attention2.3BLFM Open v1.0

Nemotron 3 Nano is the long context champion: 86.3% on RULER at 1 million tokens, with only 3.5B active parameters. Mamba's linear complexity gives it a structural advantage over pure Transformers here.

But beware: many models advertise "1M context" without publishing a RULER score at that length. Without measurement, it's marketing.


Observations

What follows is not a list of definitive truths. These are patterns I observed while analyzing these models. They deserve to be verified over time.

Dense retreats above 35B, but doesn't die

For generalists above 35B, MoE (Mixture of Experts) clearly dominates: GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super... all MoE. The quality/compute ratio has become too favorable. But dense survives where it has a structural advantage: Llama 3.3 70B (generalist, MMLU 86.0), InternVL3-78B (vision, MMMU 72.2), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice, no longer the default.

Parameter count is no longer the determining factor

Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond. Architecture, training method (distillation + multi-agent RL), and data quality matter more than raw size.

Qwen has become the de facto base model

BFS-Prover (base Qwen2.5-32B), Goedel-Prover (base Qwen3-32B), Kimina-Prover (base Qwen2.5-72B), most community distillations: everything builds on Qwen. It's the equivalent of what ResNet was for transfer learning in vision a decade ago.

InternVL3 is the best open-source VLM nobody was talking about

InternVL3-78B (Shanghai AI Lab) reaches 72.2 on MMMU — on par with GPT-4o — under Apache 2.0. With a range from 1B to 78B, it's the direct competitor to Gemma 4 for multimodal. And InternLM3-8B proves you can reach SOTA with 75% fewer training tokens (4T instead of 15-18T). The lab gets less press than Alibaba, but the results speak.

The 40-79B segment is the dense survivors' refuge

New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is still well populated by quality dense models: Llama 3.3 70B (Dec 2024), InternVL3-78B (Apr 2025), Kimina-Prover-72B (Apr 2025), Qwen 2.5-72B, R1-Distill-70B, Jamba 1.6 Mini 52B. This is where dense resists, and where you find both solid generalists and specialists (vision, theorem proving, math).

Step-3.5-Flash is the Swiss Army knife

It appears in 4 categories (code, generalist, agents, speed): SWE-bench 74.4%, 350 tok/s, and some of the best agent scores. If you could only deploy one model on a multi-GPU server, it's probably the most versatile.

GPT-OSS-120B has the best active-params-to-performance ratio

5.1B active parameters for a Codeforces ELO of 2622 and 96.6% on AIME. The most efficient model in the landscape for coding and math.


Licenses: don't overlook this

Most models listed here are under Apache 2.0: free commercial use, no geographic restriction, patent grant included, irrevocable license. Same license as TensorFlow or Kubernetes.

Notable exceptions:

LicenseModelsStatus
Apache 2.0Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-FlashExploitable everywhere
MITGLM-4.5-Air, DeepSeek R1-Distill, Phi-4Exploitable everywhere
Nemotron OMLNemotron 3 Nano/SuperExploitable (custom, royalty-free, not OSI)
Llama CommunityLlama 3.3 70B, Llama 3.2 1B/3B (text-only)EU OK (700M MAU threshold)
LFM Open v1.0LFM2, LFM2.5Exploitable < $10M revenue

Gemma 4 under Apache 2.0 is a turning point. Google previously used a restrictive custom license (Gemma Terms of Use). The switch to Apache 2.0 aligns Gemma with the rest of the open-source ecosystem.

Rejected models


How to choose

ConstraintRecommendation
Smartphone / edge (< 4 GB)Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B
Laptop 16 GBGPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B
Desktop 24 GBGemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2
Desktop 48+ GB (dense 70B)Llama 3.3 70B (MMLU 86.0, HumanEval 88.4, EU OK)
Server single-GPUGPT-OSS-120B
Server multi-GPUStep-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B
Long context (> 256K)Nemotron 3 Nano
MathNemotron Nano 9B v2 (with /think), GPT-OSS-120B
Code (real bugs)Step-3.5-Flash, Devstral Small 2
Multilingual (> 100 languages)Qwen 3.5 (201 languages), Qwen 3 (119 languages)

What's next

This overview covers text LLMs. More articles will follow on specialized models: embedding and retrieval, speech recognition, text-to-speech, image generation, theorem provers (Lean 4), and GUI agents.

The data in this article comes from a systematic review of over 60 models, with benchmark and license verification against primary sources (papers, HuggingFace, official repositories). The public reference for all 71 benchmarks (with paper, dataset and leaderboard links) is available at github.com/xigh/open-weight-models.


If you spot an error or a missing model, get in touch.


Questions about this article or your own project? Book a consultation