Back

Small coding models on Terminal-Bench 2

Updated on: June 4th 2026
Original date: Feb 26th 2026

Frontier models get most of the headlines, but the more interesting race is happening one tier down. Here’s how open-weight and smaller models stack up on Terminal-Bench 2.0.

Benchmark Comparison

Small Coding Models

Terminal-Bench 2.0

Source: Terminal-Bench 2.0 leaderboard. All Qwen3.5 MoE models use activated parameter counts (A-suffix). K2.5-1T-A32B is a 1T-parameter sparse MoE from Moonshot AI with 32B active parameters.

The top of the chart is no longer just a Qwen3.5 story. Qwen3.6-27B now leads this group at 59.3%, outperforming even the much larger Qwen3.5-397B-A17B at 52.5%. Qwen3.6-35B-A3B also lands in the top tier at 51.5%, suggesting Alibaba’s newer generation is pushing small-model coding performance meaningfully higher.

Just below that, the older leaders still hold up well. K2.5-1T-A32B scores 50.8%, Qwen3.5-122B-A10B reaches 49.4%, and Gemma4-31B comes in at 42.9%. In the same general size class, Qwen3.5-27B posts 41.6%, Qwen3.5-35B-A3B scores 40.5%, and Gemma4-26BA4B trails at 34.2%.

GPT-OSS-120B at 18.7% remains the clearest outlier. Even at a much larger footprint than the 27B–35B class, it underperforms models that are dramatically smaller on disk. Once you add model size as a second dimension, the efficiency story becomes much more interesting than the raw leaderboard alone.

| Model | Score | Provider | GGUF | Size | |-------|-------|----------|------|------| | Qwen3.6-27B | 59.3% | Alibaba | unsloth/Qwen3.6-27B-GGUF | 16.8 GB | | Qwen3.5-397B-A17B | 52.5% | Alibaba | unsloth/Qwen3.5-397B-A17B-GGUF | 244 GB | | Qwen3.6-35B-A3B | 51.5% | Alibaba | unsloth/Qwen3.6-35B-A3B-GGUF | 22.1 GB | | K2.5-1T-A32B | 50.8% | Moonshot | unsloth/Kimi-K2.5-GGUF | 621 GB | | Qwen3.5-122B-A10B | 49.4% | Alibaba | unsloth/Qwen3.5-122B-A10B-GGUF | 76.5 GB | | Gemma4-31B | 42.9% | Google | unsloth/gemma-4-31B-it-GGUF | 18.3 GB | | Qwen3.5-27B | 41.6% | Alibaba | unsloth/Qwen3.5-27B-GGUF | 16.7 GB | | Qwen3.5-35B-A3B | 40.5% | Alibaba | unsloth/Qwen3.5-35B-A3B-GGUF | 22 GB | | Gemma4-26BA4B | 34.2% | Google | unsloth/gemma-4-26B-A4B-it-GGUF | 16.9 GB | | GPT-OSS-120B | 18.7% | OpenAI | unsloth/gpt-oss-120b-GGUF | 62.8 GB |

Intelligence vs Size

Terminal-Bench Score vs Model Size

Q4_K_M GGUF size in GB

Note: Sizes are GGUF download sizes for the same quantization level, Q4_K_M. This chart compares storage footprint against Terminal-Bench 2.0 score, making the efficiency tradeoff more visible than the leaderboard alone.

Using the same Q4_K_M quantization across the board makes the size comparison much cleaner. The scatter plot shows the main takeaway immediately: Qwen3.6-27B sits in the best part of the frontier here, delivering the highest score while staying under 17 GB. Gemma4-31B and Qwen3.5-27B also look strong on a score-per-GB basis, while the giant MoE checkpoints buy you some extra capability at a very steep storage cost.

The ceiling for this set is now close to 60%, which is a meaningful jump from the earlier ~52% range. Small and mid-sized open models are improving fast, and the newest Qwen3.6 entries make that trend hard to ignore.

For the overlapping text benchmarks, Qwen3.5-9B looks stronger on MMLU-Pro and GPQA Diamond, while Gemma 4 12B Unified is ahead on LiveCodeBench v6 and MMMLU. The Tau2 row is less clean because the labels differ: Gemma reports “Tau2 (average over 3)”, while Qwen reports “TAU2-Bench”.

| Metric | Gemma 4 12B | Qwen3.5-9B | |---|---:|---:| | MMLU Pro | 77.2% | 82.5% | | GPQA Diamond | 78.8% | 81.7% | | LiveCodeBench v6 | 72.0% | 65.6% | | MMMLU | 83.4% | 81.2% | | Tau2 / TAU2-Bench | 69.0% | 79.1% |