Chatbot Arena Leaderboard: Compare Every Major AI Model in One Place

⬅️ Back to Tools

In 2026 there are too many AI models to track by word of mouth. New releases land weekly, every lab has multiple tiers, and benchmark scores are published selectively. If you need to choose a model for a project, a product, or research, you need a leaderboard; but not all leaderboards measure the same thing.

Here is every major model leaderboard worth using, what it actually measures, and when to trust it.

OpenLM Chatbot Arena+

chatbot-arena-plus.openlm.ai

Chatbot Arena+ is the most useful single page for model selection in 2026 because it aggregates multiple benchmarks into one sortable table: Arena Elo (crowdsourced human preference from 6M+ blind votes), AAII (Artificial Analysis Intelligence Index), ARC-AGI (reasoning), Coding Elo, Vision Elo, and MMLU-Pro. Every model gets a row with all scores visible at a glance.

The table includes both proprietary and open-weight models, shows license type, and updates regularly. It is maintained by OpenLM.ai and pulls from the official dataset on Hugging Face. The top of the table in May 2026 looks like GPT-5.5-high at 1506 Elo, Gemini-2.5-Pro holding strong at 1460, and open-weight models like GLM-5.1 and DeepSeek-V3.2 competing in the 1420-1470 range.

Use this when: you want the fastest, most complete snapshot of the current frontier across every major metric in one table.

Caveat: The composite blends different methodologies (human voting vs automated benchmarks), so the ranking is informative rather than statistically rigorous.

Arena AI (formerly LMSYS Chatbot Arena)

arena.ai

This is the source. Arena AI runs the actual blind side-by-side voting platform where users compare two anonymous models and pick the better response. The Elo ratings here are the raw, unaggregated data that other sites pull in.

The platform now tracks 10 arenas (Text, Vision, Code, Document, Search, Text-to-Image, Image-Edit, Text-to-Video, Image-to-Video, and Video-Edit), each with its own leaderboard and style-controlled variants. The full historical dataset is available on Hugging Face as lmarena-ai/leaderboard-dataset.

Arena AI also ships Arena Expert (filtered to expert-level prompts) and Arena Max (a model router that picks the best model for each prompt based on 5M+ votes). The March 2026 update added price-per-token and context window columns to every leaderboard, making it possible to compare models on cost and performance simultaneously.

Use this when: you want the purest signal (human preference, blind, crowdsourced) and are willing to look at one modality at a time.

Caveat: Elo is relative to the current model pool. When stronger models enter, existing scores drift down even if the model hasn’t changed.

Artificial Analysis

artificialanalysis.ai

Artificial Analysis measures models on four independent axes: intelligence (their own AA Intelligence Index v4.0, a composite of 10 evaluations including GPQA Diamond, Humanity’s Last Exam, and GDPval-AA), speed (tokens per second), price (USD per million tokens), and latency (time to first token). Every model gets a score on all four, so you can find the best model under a price ceiling or within a latency budget.

They also run a Coding Agent Index (composite pass@1 on SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA) and maintain separate leaderboards for image generation and video generation models. Their scatter plots (intelligence vs price, intelligence vs speed) are the clearest way to see where a model sits relative to its peers.

As of May 2026, GPT-5.5 (xhigh) leads the Intelligence Index at 60, followed by Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 57 and Gemini 3.1 Pro Preview at 57. Mercury 2 is the fastest model at 666 tokens/s.

Use this when: you need to make a build-vs-buy decision and care about cost or latency alongside quality.

Caveat: The Intelligence Index is an automated composite, not human preference. A model that scores well on benchmarks may not feel better in chat.

LLM Stats

llm-stats.com

LLM Stats combines GPQA Diamond, SWE-Bench Verified, coding-arena performance, and pricing into a single LLM Stats Score. The catalog tracks 298 canonical models across every major lab, with separate leaderboards for open-weight models, coding, math, writing, reasoning, image generation, video, tool calling, long context, and computer use.

The site is the most comprehensive in terms of model count and category coverage. Pricing revalidates hourly, live performance metrics use a 7-day rolling average, and the comparison tool lets you select up to four models to view side by side.

As of May 2026, Claude Mythos Preview leads on GPQA Diamond (94.6%), Gemini 3.1 Pro leads on coding arena score, and Kimi K2.6 is the cheapest model in the top 10 at $0.95 per million tokens.

Use this when: you want to browse the full catalog of 300 models, compare by specific benchmark scores, or need specialized leaderboards (tool calling, long context, computer use).

Caveat: The composite score is a proprietary blend; the weighting methodology matters if you are making a high-stakes decision. Read their methodology page before relying on the aggregate number.

ARC-AGI (Abstraction and Reasoning Corpus)

arcprize.org

ARC-AGI is not a general model leaderboard. It tests a specific capability: fluid intelligence, the ability to solve novel problems from a few examples, without relying on memorized training data. The benchmark uses grid-based puzzles that require pattern recognition, spatial reasoning, and rule inference.

ARC-AGI v2 was released in late 2025 with harder puzzles, and the leaderboard shows a wide spread. Most frontier models score in the 1-20% range on v2, while the best (Claude Opus 4.6 thinking, GPT-5.5-high) reach 85%. This is one of the few benchmarks where reasoning models clearly outperform their non-reasoning counterparts by a wide margin.

The ARC Prize ($1M+ in awards) has driven significant progress; from near-zero scores in 2024 to frontier models breaking 85% on v2 by early 2026.

Use this when: you want to measure genuine reasoning capability independent of training data contamination, or you care specifically about few-shot adaptation.

Caveat: ARC-AGI is narrow. A high ARC score does not guarantee good chat, good coding, or good agent performance. It measures one specific type of reasoning.

Agent Arena & Berkeley Function Calling Leaderboard

gorilla.cs.berkeley.edu

The Berkeley team runs two complementary leaderboards for agentic AI:

Agent Arena is a live sandbox where users vote on agent performance across tasks (coding, research, data analysis) with different LLM backends, frameworks (LangChain, LlamaIndex, CrewAI), and tools. The leaderboard ranks agents, models, frameworks, and tools separately, using Elo ratings from head-to-head comparisons.

BFCL v4 (Berkeley Function Calling Leaderboard) evaluates models on tool use: simple and parallel function calling, multi-turn interactions, web search, memory management, and format sensitivity. The v4 weighting shifted 40% of the score to agentic tasks (web search + memory), reflecting the industry’s move from single-turn function calls to autonomous agents. Qwen3.5-397B-A17B leads at 72.9% overall accuracy.

Use this when: you are building agentic systems and need to know which models handle tool calling, web search, and multi-step tasks reliably.

Caveat: Function calling leaderboards measure the model’s ability to produce valid JSON function calls in isolation, not its performance in a full agent loop with error recovery.

How to Read Them Together

No single leaderboard tells the full story. Each measures a different signal:

What you care aboutUse
General chat qualityArena AI Elo (human preference)
Fastest overviewOpenLM Chatbot Arena+ (one table)
Price vs performanceArtificial Analysis (intelligence x cost x speed)
Deep model catalogLLM Stats (300 models, 20+ categories)
Pure reasoningARC-AGI (fluid intelligence)
Tool calling / agentsBFCL v4 + Agent Arena

If you are choosing a model for production, check at least three: Arena AI for human preference, Artificial Analysis for cost and speed, and either BFCL or ARC-AGI depending on whether you need tool use or reasoning. If the model you are considering ranks high on all three, it is probably the right choice.

Links

Crepi il lupo! 🐺