Gian.cool Gianfranco's blog
-
Small coding models on Terminal-Bench 2
Read more →Updated on: June 4th 2026
Original date: Feb 26th 2026Frontier models get most of the headlines, but the more interesting race is happening one tier down. Here’s how open-weight and smaller models stack up on Terminal-Bench 2.0.
Benchmark ComparisonSmall Coding Models
Terminal-Bench 2.0Source: Terminal-Bench 2.0 leaderboard. All Qwen3.5 MoE models use activated parameter counts (A-suffix). K2.5-1T-A32B is a 1T-parameter sparse MoE from Moonshot AI with 32B active parameters.
-
AgentCaller: a phone layer for AI agents
Read more →AI agents are getting pretty good at the internet.
They can search, compare options, fill out forms, click through workflows, send emails, and call APIs. But much of the real world still sits beyond their reach, behind a channel most agents cannot use: the phone network.
That is the idea behind AgentCaller.io, a product I am testing now.
The pitch is simple: let a user’s AI agent call businesses, handle the conversation, and return a structured result the agent can act on.
Not a human call center. Not a browser automation hack. A phone interface built for agents.
-
Opus 4.6 vs GPT Codex 5.3 vs GPT 5.4
Read more →Updated to include GPT-5.4 and Gemini 3.1 Pro
A comparison of benchmark metrics between Opus 4.6 and Codex 5.3 models.
Anthropic and OpenAI both recently published Terminal-Bench 2.0 results, but in separate charts and a table. I wanted the full picture, so I combined them.
Benchmark ComparisonAgentic Coding
Terminal-Bench 2.0Note: All OpenAI models shown at xhigh compute setting. GPT-5.2-Codex appears twice — 64.7% as reported by Anthropic, 64.0% as reported by OpenAI. Harnesses differ: Anthropic & Google used the Terminus-2 harness; OpenAI used Codex. Scores are not directly comparable across providers.
-
I Built a Desktop Audio Converter With Claude Code
Read more →I’ve been meaning to build this app, and I actually started around this time last year. But after learning how to “code” or build with AI coding agents like
claudecode, I just gave it this prompt:plan how to complete this app. it should allow one or multiple files to be selected or dragged (audio only) and then it should show a box to select which format to convert to e.g mp3, wav, aac,ogg, flac,m4a, mp4) plus certain options that come from ffmpeg to compress the file
It wrote this comprehensive plan. And the it went for it. I asked a small question to fix a small UI color issue. And voila!
A(I) built Audioslim, a native macOS app that converts audio between MP3, WAV, AAC, OGG, FLAC, M4A, and MP4. I built it with claude code, Anthropic’s AI coding assistant for the terminal.
