Guide

AI Benchmarks, Explained

Name: AI Model Elo Scores Dataset
Creator: WhoLeads.AI

Every lab says their model is the best. Benchmarks are the receipts. Here's what each one actually tests, who's winning, and which scores you should pay attention to.

Jump to what you care about

Which AI do people actually prefer?LMArena Can it reason through new problems?ARC-AGI-2 Can it fix real bugs in real codebases?SWE-Bench How broad is its knowledge?MMLU Can it work autonomously?APEX-Agents PhD-level science?GPQA Diamond

The popularity contest

Automated benchmarks can be gamed. These two measure what real humans prefer when they can't see the label.

LMArena (Chatbot Arena)

Blind human voting

Why it matters: Every other benchmark on this page is designed by researchers with specific tasks in mind. LMArena is the only one that measures what regular people actually prefer when they're not told which model they're using. If you're choosing an AI for everyday work — writing, brainstorming, answering questions — this is the score that predicts whether you'll like it. When the #1 spot changes here, it moves enterprise contracts and API adoption.

Two models answer the same prompt. You pick the better one without knowing which is which. Thousands of these matchups produce an Elo rating, same system as chess. It's messy and democratic and probably the single most important number in AI right now.

A 30-point Elo gap is noticeable. A 5-point gap is noise. The current leader is Claude Opus 4.6 at 1505, edging out Gemini 3 Pro by about 20 points. That lead has changed hands three times in the last six months.

GDPval-AA

Expert task preference

Why it matters: A model that scores well on LMArena might just be good at sounding confident. GDPval-AA tells you if the output is actually useful to someone who knows what they're looking at. If you're a lawyer, analyst, or engineer relying on AI for real work product, this is the score that predicts whether you'll catch yourself fixing its answers or actually using them.

Same blind-voting idea as LMArena, but the judges are professionals evaluating outputs they'd actually use at work — legal analysis, technical reports, business strategy. A model can charm casual users on LMArena with verbose, confident-sounding answers that would get torn apart by someone who knows the domain. Claude Opus 4.6 leads here too, at 1606.

Can it actually think?

The line between memorization and reasoning. These benchmarks try to find it.

ARC-AGI-2

Novel pattern reasoning

Why it matters: This is the closest thing we have to answering the question everyone's actually asking: can AI think, or is it just a very sophisticated autocomplete? Every improvement here means AI gets better at solving problems it wasn't specifically trained for — debugging novel issues, adapting to unfamiliar situations, handling the stuff that doesn't have a Stack Overflow answer. It's the difference between a tool you can only use for known tasks and one that can genuinely surprise you.

ARC-AGI-2 throws pattern puzzles at models that they couldn't have seen in training data. Colored grids with hidden rules — figure out the pattern, complete the sequence. No memorization, no shortcuts. Either you can reason or you can't.

Google's Gemini 3.1 Pro just scored 77.1%, which would have been unthinkable a year ago when the best score was 31%. That jump alone is one of the strongest arguments against the “AI progress is plateauing” crowd.

Score range: 0–100%. Anything above 50% is strong. The gap from 31% to 77% happened in one model generation.

GPQA Diamond

PhD-level science

Why it matters: Drug discovery, materials science, climate modeling — these fields move as fast as researchers can work through complex problems. A model that scores well here could accelerate scientific research by acting as a collaborator that keeps up with PhD-level reasoning. We're approaching the point where AI could meaningfully speed up how quickly we develop new medicines and materials.

Graduate-level questions across physics, chemistry, and biology. Not textbook stuff — multi-step reasoning problems that trip up actual PhD students. Domain experts score about 65% on questions outside their specialty. Gemini 3.1 Pro hits 94.3%, which is getting into uncomfortable territory for anyone who thinks “AI can't really understand science.”

Humanity's Last Exam

Hardest questions humans can write

Why it matters: This is the canary in the coal mine for when AI starts to approach human-level expertise at the highest levels. Right now, the best model gets it wrong more than half the time. When something cracks 80% on this test, we're in a fundamentally different world — one where AI can engage with the hardest open problems in mathematics and philosophy. That has implications for everything from theorem proving to policy analysis.

Researchers worldwide submitted their toughest questions — problems from the edges of mathematics, physics, and philosophy. The name is deliberately dramatic. The questions are deliberately brutal.

Even the best model (Gemini 3.1 Pro) only manages 44.4%. More than half the time, the smartest AI we have gets it wrong. That's the point — this benchmark is designed to stay hard for a while.

Code that ships

Writing a function is easy. Fixing a bug in a 50,000-line codebase is the actual job.

SWE-Bench Verified / Pro

Real GitHub bug fixes

Why it matters: Software has bugs. Fixing them eats an enormous share of developer time. A model that scores well here can look at a real codebase, understand the architecture, find what's broken, and write a fix that passes tests — without hand-holding. Every percentage point improvement translates directly to fewer hours your engineering team spends on maintenance. At 60%+, you're looking at a tool that can handle a meaningful chunk of your bug backlog autonomously.

Hand the model a real bug report from Django or Flask. It reads the codebase, finds the problem, writes a fix, and the fix has to pass the project's existing tests. No hints. The “Verified” variant uses human-validated issues; “Pro” throws harder multi-file problems at it.

GPT-5.3-Codex leads at 56.8% on Pro. Most models struggle past 40%. If you want to know whether an AI can actually do software engineering — not just autocomplete — this is the test.

LiveCodeBench Pro

Competitive programming, live

Why it matters: Competitive programming tests algorithmic thinking under pressure — the kind of problem-solving that underlies database optimization, route planning, financial modeling, and any domain where you need an efficient solution, not just a correct one. Improvements here mean AI gets better at the hard computational problems behind the scenes of most software.

Fresh competitive programming problems published after training cutoffs, so models can't have seen them. Algorithmic, timed, unforgiving. Gemini 3.1 Pro's 2887 Elo puts it in competitive-programmer territory — the kind of score that would qualify for Codeforces rounds against humans.

Terminal-Bench

CLI and sysadmin tasks

Why it matters: DevOps and infrastructure work is repetitive, error-prone, and expensive. A model that's good at Terminal-Bench can configure servers, debug deployments, and chain shell commands — the kind of work that currently requires senior engineers or expensive consultants. This is the path to AI that doesn't just write code but actually deploys and operates it.

Multi-step terminal work: configuring servers, debugging deployments, chaining shell commands together. GPT-5.3-Codex handles 77.3% of tasks correctly.

HumanEval / MBPP

Basic code generation

Why it matters: It doesn't, anymore — at least not for comparing frontier models. Every serious model scores 95%+. But historically, this was the benchmark that proved AI could write code at all, and it's still a useful sanity check for smaller or open-source models.

The originals. “Write a function that reverses a string.” Groundbreaking in 2022, mostly a checkbox now.

SciCode

Scientific computing

Why it matters: Most scientific computing is done by researchers who are domain experts but not professional programmers. A model that's good at SciCode could become a force multiplier for every research lab — turning a physicist's description of a simulation into working code, correctly handling the math that makes or breaks the result.

Implementing physics simulations, numerical methods, data analysis from real research papers. You need both coding chops and domain knowledge — getting the math wrong means the code is useless even if it runs. The top score right now is Gemini 3.1 Pro at 59%, which tells you how hard this is.

Raw knowledge

MMLU / MMLU-Pro

57 subjects, multiple choice

Why it matters: When you ask an AI a question about history, law, medicine, or anything else outside of coding and math, MMLU is what predicts whether you'll get a solid answer. It's the broadest measure of whether a model is generally knowledgeable or has blind spots. For anyone using AI as a research assistant or general-purpose tool, this is your baseline confidence score.

Multiple-choice questions spanning everything from abstract algebra to world religions. MMLU-Pro bumps the answer choices from 4 to 10, which kills the guessing advantage and separates models more clearly at the top.

Frontier models cluster between 85–92% on standard MMLU — useful for catching laggards but not great for splitting hairs between leaders. MMLU-Pro (70–80% range) still has room to differentiate.

Doing real work

APEX-Agents

Autonomous professional tasks

Why it matters: This is where AI goes from “tool you use” to “worker you hire.” Every AI company is building agents — systems that don't just answer questions but actually do multi-step work: research a company, draft a report, email the summary. Every improvement here brings us closer to AI that can handle entire workflows end-to-end, which changes the economics of knowledge work more than any other capability on this page.

Multi-step professional tasks with real tools, real interfaces, and real judgment calls required. Gemini 3.1 Pro leads at 33.5%, which sounds low until you consider how many places a multi-step workflow can break down.

This is the benchmark that will matter most going forward. Every agent will eventually be measured against something like this.

No single benchmark tells you everything. Models get optimized for specific tests. Leaders change monthly. The smartest thing you can do is look at several benchmarks together and notice which models keep showing up near the top across different skills.

Scores here reflect the landscape as of early 2026. For live rankings, check the WhoLeads.AI leaderboard.

LIVE ALERTS

Never Miss a New #1

Get notified when a new model takes the top spot.

No spam. Unsubscribe anytime.

Published by

WhoLeads.AI

View Live Rankings →