Google·

Gemini 3.1 Pro: Google's Reasoning Leap

Google just more than doubled its reasoning score on ARC-AGI-2. It leads most major benchmarks. And it still can't take #1. The top three models are separated by 5 Elo points. We've never seen anything like this.

The Numbers
1500Elo — 5 points behind Claude Opus 4.6

Google shipped Gemini 3.1 Pro yesterday with the kind of framing that tells you everything about where the industry is right now: “evolution not revolution.” Translation: we made something genuinely impressive, but we know you're going to compare it to Claude, and we don't want the headline to be “still not #1.”

Too late. Let's talk about it.

The reasoning leap that wasn't supposed to happen yet

ARC-AGI-2 was designed to be hard for years. The benchmark tests novel reasoning — pattern recognition and abstraction that can't be solved by memorizing training data. When it launched, the best models were scraping by in the low 30s.

Gemini 3 Pro scored 31.1%. Respectable. Expected.

Gemini 3.1 Pro scores 77.1%. That's not incremental improvement. That's a different capability tier. Whatever Google did to the reasoning architecture between versions, it worked. The kind of jump that makes benchmark designers start thinking about ARC-AGI-3.

For context, this is the largest single-generation improvement on a major reasoning benchmark we've tracked. More than doubling your score on a test specifically designed to resist rapid improvement? That's the real headline, not the Elo ranking.

No model dominates everything anymore

Here's what makes February 2026 different from any previous moment in AI: the leaderboard has fragmented. There is no single best model. There are three best models, each winning different games.

Who Leads What
Gemini 3.1 Pro leads
Claude Opus 4.6 leads
LMArena Text (overall)1505 Elo
GDPval-AA (expert tasks)1606
Humanity's Last Exam + tools53.1%
Arena Coding#1
GPT-5.3-Codex leads

This is genuinely new territory. A year ago, whoever led LMArena basically led everything. Now? Gemini dominates reasoning and science benchmarks. Claude wins at expert-level tasks and human preference. GPT-5.3-Codex owns real-world coding. The “best AI model” question no longer has a single answer.

The 5-point gap: tightest race in LMArena history

Let's put the LMArena numbers in perspective.

LMArena Text Leaderboard — Top 3
#1Claude Opus 4.61505
#2Claude Opus 4.6 Thinking1504
#3Gemini 3.1 Pro1500

Five Elo points from #1 to #3. In statistical terms, this is barely distinguishable from noise. At this separation, the model that “wins” a head-to-head depends as much on the specific prompt as on the model's actual capability. We're in coin-flip territory.

Compare this to a year ago, when GPT-4o led by 40+ points. Or six months ago, when Gemini 3 Pro had a comfortable 10-point cushion. The convergence is accelerating. Every major lab is approaching the same capability ceiling from different angles — and reaching it at the same time.

The practical implication? For most use cases, the difference between these three models is negligible. The choice comes down to pricing, API ergonomics, and which specific benchmarks matter for your workflow. The era of one model clearly outclassing the field is over.

What's actually new in 3.1 Pro

Beyond the benchmarks, a few capabilities stand out:

Animated SVG generation. Give it a text prompt, get back a working animated SVG. This sounds like a party trick until you realize how many product teams spend days hand-coding SVG animations for landing pages. Early examples circulating on Twitter show surprisingly polished results — spinning logos, morphing shapes, animated data visualizations. A genuine new modality.

Complex system synthesis. Google demoed 3.1 Pro building a live aerospace dashboard that configures a public telemetry stream to visualize the ISS orbit in real-time. The model didn't just write code — it reasoned through API configuration, data pipeline setup, and UI design in a single pass. This is the kind of multi-step system thinking that separates "can write a function" from "can architect a feature."

Interactive 3D generation. In another demo, 3.1 Pro coded a 3D starling murmuration with hand-tracking controls and a generative audio score that shifts based on the flock's movement. For researchers and designers prototyping sensory-rich interfaces, this kind of creative-technical crossover is genuinely new territory.

Still in preview. This is not yet GA — general availability is coming soon. Which means the numbers could shift. Google has a history of squeezing additional performance between preview and general availability. If they find even 5-10 more Elo points, this story changes completely.

The agentic future Google is betting on

The APEX-Agents score deserves its own section. At 33.5%, Gemini 3.1 Pro leads the benchmark that tests autonomous multi-step task completion — the kind of work that matters when you're building AI systems that actually do things instead of just answering questions.

Google is clearly optimizing for this. The simultaneous launch of Antigravity, their agentic AI platform, isn't a coincidence. The thesis: the next phase of AI value creation isn't chatbots, it's agents. Models that can navigate UIs, call APIs, handle multi-step workflows, and recover from errors without human intervention.

If that thesis is right — and the money flowing into agent infrastructure suggests it is — then APEX-Agents might matter more than LMArena Elo within a year. Google is positioning Gemini as the model you build agents on, even if it's not the model that wins the chatbot popularity contest.

How long until someone takes #1?

Claude Opus 4.6 has held #1 for 12 days. The average tenure at #1 over the past year is about 40 days. But with a 5-point gap, “holding #1” is almost a statistical artifact at this point.

The question isn't whether someone takes the crown. It's whether anyone can open a meaningful gap again. Three scenarios:

Gemini 3.1 Pro GA pushes past Claude. Google has room to optimize between preview and release. Even modest gains put them at #1. Probability: moderate.

OpenAI drops GPT-5.5. They've been uncharacteristically quiet since GPT-5.3-Codex. A flagship text model update could leapfrog both. Probability: uncertain timing.

The gap stays small. We've entered a regime where all frontier models converge to similar capability. #1 rotates every few weeks based on minor updates. This is the most likely outcome — and maybe the most interesting one.

Google called Gemini 3.1 Pro “evolution not revolution.” They're right about the framing, wrong about the implication. The evolution is the story. When three labs can each ship a frontier model within weeks of each other, all within coin-flip distance of #1, that is the revolution. Competition this tight drives progress faster than any single breakthrough.

We'll be watching the GA release closely.

LIVE ALERTS

Never Miss a New #1

Get notified when a new model takes the top spot.

No spam. Unsubscribe anytime.

Published by
WhoLeads.AI
View Live Rankings →