The Contest
Qwen 3 — the shiny new “mixture-of-experts” giant developed by Alibaba team.
Llama 4-Maverick — Meta’s latest pride, tuned to punch above its weight.
Qwen 2.5 Coder — an “old horse”, fine-tuned purely for software jobs.
The Arena
We ran Aider’s polyglot benchmark: 200-plus programming puzzles. Each model tackles every set twice in over a dozen languages — Python, JavaScript, C, C++, Rust, Go, Java and friends. The tasks cover algorithms, data structures, file I/O, string wrangling, even a dash of concurrency. Because the framework is open-sourced, nobody can claim a bent referee.
The Leaderboard
Qwen 3 — 2.7 % first-try, 2.7 % second-try (and an eye-watering 444 error-outs).
Llama-4 Maverick — 2.2 % first-try, 7.6 % second-try.
Qwen 2.5 — 1.8 % first-try passes, 8.9 % on the re-run.
Yes, you read those numbers correctly: not eighty-nine but eight point nine. Even after a second swing at the same problem, the medals cabinet looks tragically bare. Qwen 3 is newer and bigger, yet it stumbled on more problems than the older, coding-trained Qwen 2.5. Llama-4 kept pace with the veteran but still left over ninety per cent of the suite unsolved.
So, should developers sleep easy?
For tonight, absolutely. A machine that fails on ninety-five per cent of the questions will not be leading the code review. Yet it would be folly to laugh too loudly. Two years ago these pass rates were zero. Progress, though patchy, is relentless, and every “error-out” is a researcher’s next coffee-fuelled weekend.
In other words: keep learning, keep refactoring — and maybe keep your CV polished just in case the talking toaster stops burning the toast and starts compiling the kernel.
We invite you to move beyond the limitations of legacy, usage-based cloud models and begin evaluating your path to predictable AI at scale.
It is time to choose the platform built to evolve your business, not your bills. Try Hyperfusion now at https://hyperfusion.io/