4 June 2026

Small coding models on Terminal-Bench 2

Updated on: June 4th 2026
Original date: Feb 26th 2026

Frontier models get most of the headlines, but the more interesting race is happening one tier down. Here’s how open-weight and smaller models stack up on Terminal-Bench 2.0.

Benchmark Comparison

Small Coding Models

Terminal-Bench 2.0

Source: Terminal-Bench 2.0 leaderboard. All Qwen3.5 MoE models use activated parameter counts (A-suffix). K2.5-1T-A32B is a 1T-parameter sparse MoE from Moonshot AI with 32B active parameters.

The top of the chart is no longer just a Qwen3.5 story. Qwen3.6-27B now leads this group at 59.3%, outperforming even the much larger Qwen3.5-397B-A17B at 52.5%. Qwen3.6-35B-A3B also lands in the top tier at 51.5%, suggesting Alibaba’s newer generation is pushing small-model coding performance meaningfully higher.

Just below that, the older leaders still hold up well. K2.5-1T-A32B scores 50.8%, Qwen3.5-122B-A10B reaches 49.4%, and Gemma4-31B comes in at 42.9%. In the same general size class, Qwen3.5-27B posts 41.6%, Qwen3.5-35B-A3B scores 40.5%, and Gemma4-26BA4B trails at 34.2%.

GPT-OSS-120B at 18.7% remains the clearest outlier. Even at a much larger footprint than the 27B–35B class, it underperforms models that are dramatically smaller on disk. Once you add model size as a second dimension, the efficiency story becomes much more interesting than the raw leaderboard alone.

Model	Score	Provider	GGUF	Size
Qwen3.6-27B	59.3%	Alibaba	`unsloth/Qwen3.6-27B-GGUF`	16.8 GB
Qwen3.5-397B-A17B	52.5%	Alibaba	`unsloth/Qwen3.5-397B-A17B-GGUF`	244 GB
Qwen3.6-35B-A3B	51.5%	Alibaba	`unsloth/Qwen3.6-35B-A3B-GGUF`	22.1 GB
K2.5-1T-A32B	50.8%	Moonshot	`unsloth/Kimi-K2.5-GGUF`	621 GB
Qwen3.5-122B-A10B	49.4%	Alibaba	`unsloth/Qwen3.5-122B-A10B-GGUF`	76.5 GB
Gemma4-31B	42.9%	Google	`unsloth/gemma-4-31B-it-GGUF`	18.3 GB
Qwen3.5-27B	41.6%	Alibaba	`unsloth/Qwen3.5-27B-GGUF`	16.7 GB
Qwen3.5-35B-A3B	40.5%	Alibaba	`unsloth/Qwen3.5-35B-A3B-GGUF`	22 GB
Gemma4-26BA4B	34.2%	Google	`unsloth/gemma-4-26B-A4B-it-GGUF`	16.9 GB
GPT-OSS-120B	18.7%	OpenAI	`unsloth/gpt-oss-120b-GGUF`	62.8 GB

Intelligence vs Size

Terminal-Bench Score vs Model Size

Q4_K_M GGUF size in GB

Note: Sizes are GGUF download sizes for the same quantization level, Q4_K_M. This chart compares storage footprint against Terminal-Bench 2.0 score, making the efficiency tradeoff more visible than the leaderboard alone.

Using the same Q4_K_M quantization across the board makes the size comparison much cleaner. The scatter plot shows the main takeaway immediately: Qwen3.6-27B sits in the best part of the frontier here, delivering the highest score while staying under 17 GB. Gemma4-31B and Qwen3.5-27B also look strong on a score-per-GB basis, while the giant MoE checkpoints buy you some extra capability at a very steep storage cost.

The ceiling for this set is now close to 60%, which is a meaningful jump from the earlier ~52% range. Small and mid-sized open models are improving fast, and the newest Qwen3.6 entries make that trend hard to ignore.

For the overlapping text benchmarks, Qwen3.5-9B looks stronger on MMLU-Pro and GPQA Diamond, while Gemma 4 12B Unified is ahead on LiveCodeBench v6 and MMMLU. The Tau2 row is less clean because the labels differ: Gemma reports “Tau2 (average over 3)”, while Qwen reports “TAU2-Bench”.

Metric	Gemma 4 12B	Qwen3.5-9B
MMLU Pro	77.2%	82.5%
GPQA Diamond	78.8%	81.7%
LiveCodeBench v6	72.0%	65.6%
MMMLU	83.4%	81.2%
Tau2 / TAU2-Bench	69.0%	79.1%