LLM Leaderboard 2026 — Compare AI Models by Benchmarks, Speed & Price

Q: GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro — which should I use?

Choose Claude Opus 4.7 for complex coding and instruction-following. Choose GPT-5.5 for general-purpose reasoning, agentic workflows, and ecosystem integration. Choose Gemini 3.1 Pro for massive context windows (1M+ tokens), multimodal tasks, and cost-effective research.

Q: What benchmarks does this LLM leaderboard track?

We track six key benchmarks: GPQA Diamond (graduate-level reasoning), AIME 2025 (competition math), SWE-Bench Verified (real-world coding), Humanity's Last Exam (expert general knowledge), ARC-AGI 2 (visual reasoning), and MMMLU (multilingual understanding). We also compare speed (tokens/sec) and API pricing per 1M tokens.

COMPARE · BENCHMARK · RANK

Track and compare the latest benchmark performance of 40+ frontier AI models. Data sourced from model providers, Artificial Analysis, BenchLM, and independently run evaluations.

LLM Leaderboard delivers clear, up-to-date rankings for reasoning, math, coding, vision, and multilingual performance while also showing speed and cost metrics — covering models from Anthropic, OpenAI, Google, Meta, xAI, DeepSeek, Alibaba, Mistral, Cohere, and more.

Compare models Best Indian LLM guide Browse all guides

Best reasoningClaude Mythos Preview94.6% GPQA Best codingClaude Mythos Preview93.9% SWE-Bench FastestLlama 4 Scout2,600 t/s CheapestNova Micro$0.04 / $0.14 India / multilingualSarvam 105BOpen weights

AI benchmark rankings by task

🧠 Reasoning · GPQA Diamond

Claude Mythos Preview — 94.6%
GPT-5.4 Pro — 94.5%
Gemini 3.1 Pro — 94.3%
Claude Opus 4.7 — 94.2%
GPT-5.5 — 93.6%

📐 Math · AIME 2025

💻 Agentic Coding · SWE-Bench

🌐 General · Humanity's Last Exam

👁 Visual Reasoning · ARC-AGI 2

🌏 Multilingual · MMMLU

Historical progress

Tracking the progression of state-of-the-art models on the GPQA Diamond benchmark from 2023 to 2026.

Speed & affordability

⚡ Fastest models (tokens/sec)

💰 Cheapest (per 1M tokens)

Compare AI models head-to-head

Attribute	—	—

Full LLM leaderboard — all AI models

Model ↕	Provider ↕	Country ↕	Context ↕	Cutoff ↕	I/O Cost ↕	GPQA ↕	SWE-Bench ↕	Speed ↕
Claude Mythos Preview	Anthropic	🇺🇸 USA	1M	Apr 2026	Limited	94.6%	93.9%	—
GPT-5.4 Pro	OpenAI	🇺🇸 USA	1M	Mar 2026	$30 / $180	94.5%	80.2%	—
Claude Opus 4.8	Anthropic	🇺🇸 USA	1M	Jun 2026	$6 / $30	94.4%	93.7%	—
Gemini 3.1 Pro	Google	🇺🇸 USA	1M	Feb 2026	$2 / $12	94.3%	80.6%	—
Claude Opus 4.7	Anthropic	🇺🇸 USA	200K	May 2025	$5 / $25	94.2%	82%	67 t/s
GPT-5.5 Pro	OpenAI	🇺🇸 USA	1M	Apr 2026	$30 / $180	94.2%	81%	—
MAI-Thinking-1	Microsoft	🇺🇸 USA	256K	Jun 2026	Private preview	93.8%	80.8%	—
GPT-5.5	OpenAI	🇺🇸 USA	1M	Apr 2026	$5 / $30	93.6%	78.6%	—
GPT-5.4	OpenAI	🇺🇸 USA	1M	Mar 2026	$5 / $30	92.8%	77.4%	—
Gemini 3 Pro	Google	🇺🇸 USA	2M	Mar 2026	$3.5 / $10.5	92.1%	76.3%	—
Claude Opus 4.6	Anthropic	🇺🇸 USA	1M	May 2025	$5 / $25	91.2%	80.8%	67 t/s
Kimi K2.6	Moonshot	🇨🇳 China	256K	Apr 2026	$0.75 / $3.50	91.1%	80.2%	—
DeepSeek R2	DeepSeek	🇨🇳 China	128K	Feb 2026	$0.55 / $2.19	89.3%	72.4%	—
Claude Sonnet 4.6	Anthropic	🇺🇸 USA	1M	Aug 2025	$3 / $15	88.5%	74.2%	55 t/s
Grok 4.3	xAI	🇺🇸 USA	256K	Apr 2026	$1.25 / $2.50	88%	74.5%	203 t/s
Qwen 3.6 Plus	Alibaba	🇨🇳 China	128K	Apr 2026	$1.50 / $4.50	87.4%	78.8%	—
DeepSeek V4 Pro	DeepSeek	🇨🇳 China	1M	Mar 2026	$0.30 / $0.50	87.1%	81%	—
Gemini 3.1 Flash-Lite	Google	🇺🇸 USA	1M	Jan 2025	$0.25 / $1.50	86.9%	62.8%	363 t/s
Kimi K2 Thinking	Moonshot	🇨🇳 China	128K	Jan 2026	$2 / $6	86.7%	69.1%	—
GPT-5.5 Instant	OpenAI	🇺🇸 USA	400K	Aug 2025	$5 / $30	85.6%	—	145 t/s
Grok 4	xAI	🇺🇸 USA	256K	Feb 2026	$3 / $15	85.2%	66.4%	—
Claude Opus 4.5	Anthropic	🇺🇸 USA	200K	Feb 2025	$15 / $75	84.8%	70.3%	45 t/s
Qwen 3.5 397B	Alibaba	🇨🇳 China	128K	Feb 2026	Open	84.2%	72.1%	—
GLM-5	Zhipu AI	🇨🇳 China	128K	Feb 2026	$1.00 / $3.20	83.5%	77.8%	—
GPT-4.1	OpenAI	🇺🇸 USA	1M	Jun 2025	$2 / $8	82.4%	68.9%	—
Gemini 2.5 Flash	Google	🇺🇸 USA	1M	Jan 2026	$0.15 / $0.6	80.3%	61.4%	780 t/s
Llama 4 Maverick	Meta	🇺🇸 USA	1M	Dec 2025	Open	80.1%	62.3%	—
DeepSeek V4 Flash	DeepSeek	🇨🇳 China	1M	Mar 2026	$0.08 / $0.28	79.4%	68.2%	—
MiniMax M2	MiniMax	🇨🇳 China	128K	Nov 2025	$0.30 / $1.20	78.2%	68.5%	—
GPT-5.4 Mini	OpenAI	🇺🇸 USA	1M	Mar 2026	$0.75 / $3	78.1%	58.4%	—
Qwen 3 235B	Alibaba	🇨🇳 China	128K	Apr 2025	Open	76.8%	60.5%	—
Llama 4 Scout	Meta	🇺🇸 USA	10M	Dec 2025	Open	76.5%	55.8%	2,600 t/s
Mistral Medium 3.5	Mistral	🇫🇷 France	256K	Apr 2026	$1.50 / $7.50	74.8%	77.6%	—
Mistral Large 3	Mistral	🇫🇷 France	256K	Oct 2025	$0.50 / $1.50	74.3%	58.1%	—
Gemini 2.0 Flash	Google	🇺🇸 USA	1M	Sep 2025	$0.1 / $0.4	74.1%	53.2%	520 t/s
Grok 4.1 Fast	xAI	🇺🇸 USA	128K	Mar 2026	$0.20 / $0.50	72.8%	51.2%	350 t/s
Cohere Command A	Cohere	🇨🇦 Canada	128K	Mar 2025	$2.50 / $10	72.4%	55.8%	—
Claude Haiku 4.5	Anthropic	🇺🇸 USA	200K	Aug 2025	$1 / $5	72.1%	58.3%	120 t/s
DeepSeek R1	DeepSeek	🇨🇳 China	128K	Dec 2024	$0.55 / $2.19	71.5%	49.2%	—
GPT-4.1 mini	OpenAI	🇺🇸 USA	1M	Jun 2025	$0.4 / $1.6	71.2%	52.1%	—
Mistral Small 4	Mistral	🇫🇷 France	256K	Feb 2026	$0.15 / $0.60	68.9%	49.2%	137 t/s
GPT-5.4 Nano	OpenAI	🇺🇸 USA	512K	Mar 2026	$0.20 / $1.25	68.5%	42.1%	—
Sarvam 105B	Sarvam AI	🇮🇳 India	128K	Jun 2025	Open	66.5%	—	—
Devstral Medium	Mistral	🇫🇷 France	128K	Jun 2025	$0.40 / $2.00	66.1%	62.4%	—
Gemma 3 27B	Google	🇺🇸 USA	131K	Mar 2025	$0.07 / $0.07	42.6%	11.4%	—
Nova Micro	Amazon	🇺🇸 USA	128K	Dec 2024	$0.04 / $0.14	—	—	—
Sarvam 30B	Sarvam AI	🇮🇳 India	32K	Jun 2025	Open	—	—	—

Model selection guides

Focused rankings for common searches — each guide uses the same data as this leaderboard.

Best LLM for coding SWE-Bench rankings Cheapest AI models API cost per 1M tokens GPT vs Claude vs Gemini Frontier comparison Best open-source LLM Open-weights models Best Indian LLMs in 2026 India-focused and multilingual models Best Multilingual LLM in 2026 Top models for cross-language and regional language support Largest context window Long-document LLMs All guides →

Frequently asked questions

What is an AI leaderboard?

An AI leaderboard is a ranking system that compares large language models (LLMs) across standardized benchmarks. LLM Leaderboard ranks 40+ models from providers like OpenAI, Anthropic, Google, xAI, DeepSeek, Alibaba, and Sarvam AI using benchmarks such as GPQA Diamond, SWE-Bench, AIME 2025, and Humanity's Last Exam.

Which is the best AI model in 2026?

There is no single best AI model — it depends on your use case. For coding, Claude Opus 4.7 and Claude Mythos Preview lead SWE-Bench. For reasoning, GPT-5.4 Pro and Gemini 3.1 Pro top GPQA Diamond. For cost efficiency, DeepSeek V4 Flash and Gemini 2.0 Flash offer the best value. For Indian language support, Sarvam AI models are purpose-built for multilingual performance.

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro — which should I use?

Choose Claude Opus 4.7 for complex coding and instruction-following. Choose GPT-5.5 for general-purpose reasoning, agentic workflows, and ecosystem integration. Choose Gemini 3.1 Pro for massive context windows (1M+ tokens), multimodal tasks, and cost-effective research. Many teams use a routing strategy — cheaper models for simple tasks, frontier models for complex ones.

What benchmarks does this LLM leaderboard track?

We track six key benchmarks: GPQA Diamond (graduate-level reasoning), AIME 2025 (competition math), SWE-Bench Verified (real-world coding), Humanity's Last Exam (expert general knowledge), ARC-AGI 2 (visual reasoning), and MMMLU (multilingual understanding). We also compare speed (tokens/sec) and API pricing per 1M tokens.

How often is this AI leaderboard updated?

Our AI leaderboard is updated daily with the latest benchmark scores, speed metrics, and pricing from model providers and independent evaluations like BenchLM and Artificial Analysis. When a new model is released, we add it within 24-48 hours.

What is Sarvam AI and how does it compare?

Sarvam AI is India's leading open-weight AI model provider. Their Sarvam 105B model is purpose-built for Indian languages and multilingual tasks. While frontier models like Claude and GPT lead on English-centric benchmarks, Sarvam models are designed for the Indian market with strong Hindi, Tamil, Telugu, and other regional language support.

About the AI leaderboard

LLM Leaderboard is an independent AI benchmark comparison platform that ranks 40+ large language models across reasoning, coding, math, vision, and multilingual performance. Our AI leaderboard helps developers, researchers, and businesses choose the best AI model for their use case by providing clear, data-driven rankings and benchmark-led model selection guidance.

We compare models from every major AI provider — including OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Claude Mythos), Google (Gemini 3.1 Pro), xAI (Grok 4.3), DeepSeek (R2, V4 Pro), Alibaba (Qwen 3.6), Meta (Llama 4), Mistral, Cohere, and Sarvam AI — all in one place.

Our leaderboard includes benchmark score history, live speed and cost metrics, and model comparison tools so you can evaluate GPT vs Claude vs Gemini vs Sarvam AI using the same objective data. This makes us a better search result for AI model benchmark comparison and LLM ranking queries.