Updated Jun 6, 2026
LLM Leaderboard hero illustration

llmleaderboard.in

LLM Leaderboard 2026 — Compare AI Models by Benchmarks, Speed & Price

COMPARE · BENCHMARK · RANK

Track and compare the latest benchmark performance of 40+ frontier AI models. Data sourced from model providers, Artificial Analysis, BenchLM, and independently run evaluations.

LLM Leaderboard delivers clear, up-to-date rankings for reasoning, math, coding, vision, and multilingual performance while also showing speed and cost metrics — covering models from Anthropic, OpenAI, Google, Meta, xAI, DeepSeek, Alibaba, Mistral, Cohere, and more.

Compare models Best Indian LLM guide Browse all guides

AI benchmark rankings by task

🧠 Reasoning · GPQA Diamond
  1. Claude Mythos Preview — 94.6%
  2. GPT-5.4 Pro — 94.5%
  3. Gemini 3.1 Pro — 94.3%
  4. Claude Opus 4.7 — 94.2%
  5. GPT-5.5 — 93.6%
📐 Math · AIME 2025
💻 Agentic Coding · SWE-Bench
🌐 General · Humanity's Last Exam
👁 Visual Reasoning · ARC-AGI 2
🌏 Multilingual · MMMLU

Tracking the progression of state-of-the-art models on the GPQA Diamond benchmark from 2023 to 2026.

Speed & affordability

Fastest models (tokens/sec)
💰 Cheapest (per 1M tokens)

Compare AI models head-to-head

VS
Attribute

Full LLM leaderboard — all AI models

Model ↕ Provider ↕ Country ↕ Context ↕ Cutoff ↕ I/O Cost ↕ GPQA ↕ SWE-Bench ↕ Speed ↕
Claude Mythos Preview Anthropic 🇺🇸 USA 1M Apr 2026 Limited 94.6% 93.9%
GPT-5.4 Pro OpenAI 🇺🇸 USA 1M Mar 2026 $30 / $180 94.5% 80.2%
Claude Opus 4.8 Anthropic 🇺🇸 USA 1M Jun 2026 $6 / $30 94.4% 93.7%
Gemini 3.1 Pro Google 🇺🇸 USA 1M Feb 2026 $2 / $12 94.3% 80.6%
Claude Opus 4.7 Anthropic 🇺🇸 USA 200K May 2025 $5 / $25 94.2% 82% 67 t/s
GPT-5.5 Pro OpenAI 🇺🇸 USA 1M Apr 2026 $30 / $180 94.2% 81%
MAI-Thinking-1 Microsoft 🇺🇸 USA 256K Jun 2026 Private preview 93.8% 80.8%
GPT-5.5 OpenAI 🇺🇸 USA 1M Apr 2026 $5 / $30 93.6% 78.6%
GPT-5.4 OpenAI 🇺🇸 USA 1M Mar 2026 $5 / $30 92.8% 77.4%
Gemini 3 Pro Google 🇺🇸 USA 2M Mar 2026 $3.5 / $10.5 92.1% 76.3%
Claude Opus 4.6 Anthropic 🇺🇸 USA 1M May 2025 $5 / $25 91.2% 80.8% 67 t/s
Kimi K2.6 Moonshot 🇨🇳 China 256K Apr 2026 $0.75 / $3.50 91.1% 80.2%
DeepSeek R2 DeepSeek 🇨🇳 China 128K Feb 2026 $0.55 / $2.19 89.3% 72.4%
Claude Sonnet 4.6 Anthropic 🇺🇸 USA 1M Aug 2025 $3 / $15 88.5% 74.2% 55 t/s
Grok 4.3 xAI 🇺🇸 USA 256K Apr 2026 $1.25 / $2.50 88% 74.5% 203 t/s
Qwen 3.6 Plus Alibaba 🇨🇳 China 128K Apr 2026 $1.50 / $4.50 87.4% 78.8%
DeepSeek V4 Pro DeepSeek 🇨🇳 China 1M Mar 2026 $0.30 / $0.50 87.1% 81%
Gemini 3.1 Flash-Lite Google 🇺🇸 USA 1M Jan 2025 $0.25 / $1.50 86.9% 62.8% 363 t/s
Kimi K2 Thinking Moonshot 🇨🇳 China 128K Jan 2026 $2 / $6 86.7% 69.1%
GPT-5.5 Instant OpenAI 🇺🇸 USA 400K Aug 2025 $5 / $30 85.6% 145 t/s
Grok 4 xAI 🇺🇸 USA 256K Feb 2026 $3 / $15 85.2% 66.4%
Claude Opus 4.5 Anthropic 🇺🇸 USA 200K Feb 2025 $15 / $75 84.8% 70.3% 45 t/s
Qwen 3.5 397B Alibaba 🇨🇳 China 128K Feb 2026 Open 84.2% 72.1%
GLM-5 Zhipu AI 🇨🇳 China 128K Feb 2026 $1.00 / $3.20 83.5% 77.8%
GPT-4.1 OpenAI 🇺🇸 USA 1M Jun 2025 $2 / $8 82.4% 68.9%
Gemini 2.5 Flash Google 🇺🇸 USA 1M Jan 2026 $0.15 / $0.6 80.3% 61.4% 780 t/s
Llama 4 Maverick Meta 🇺🇸 USA 1M Dec 2025 Open 80.1% 62.3%
DeepSeek V4 Flash DeepSeek 🇨🇳 China 1M Mar 2026 $0.08 / $0.28 79.4% 68.2%
MiniMax M2 MiniMax 🇨🇳 China 128K Nov 2025 $0.30 / $1.20 78.2% 68.5%
GPT-5.4 Mini OpenAI 🇺🇸 USA 1M Mar 2026 $0.75 / $3 78.1% 58.4%
Qwen 3 235B Alibaba 🇨🇳 China 128K Apr 2025 Open 76.8% 60.5%
Llama 4 Scout Meta 🇺🇸 USA 10M Dec 2025 Open 76.5% 55.8% 2,600 t/s
Mistral Medium 3.5 Mistral 🇫🇷 France 256K Apr 2026 $1.50 / $7.50 74.8% 77.6%
Mistral Large 3 Mistral 🇫🇷 France 256K Oct 2025 $0.50 / $1.50 74.3% 58.1%
Gemini 2.0 Flash Google 🇺🇸 USA 1M Sep 2025 $0.1 / $0.4 74.1% 53.2% 520 t/s
Grok 4.1 Fast xAI 🇺🇸 USA 128K Mar 2026 $0.20 / $0.50 72.8% 51.2% 350 t/s
Cohere Command A Cohere 🇨🇦 Canada 128K Mar 2025 $2.50 / $10 72.4% 55.8%
Claude Haiku 4.5 Anthropic 🇺🇸 USA 200K Aug 2025 $1 / $5 72.1% 58.3% 120 t/s
DeepSeek R1 DeepSeek 🇨🇳 China 128K Dec 2024 $0.55 / $2.19 71.5% 49.2%
GPT-4.1 mini OpenAI 🇺🇸 USA 1M Jun 2025 $0.4 / $1.6 71.2% 52.1%
Mistral Small 4 Mistral 🇫🇷 France 256K Feb 2026 $0.15 / $0.60 68.9% 49.2% 137 t/s
GPT-5.4 Nano OpenAI 🇺🇸 USA 512K Mar 2026 $0.20 / $1.25 68.5% 42.1%
Sarvam 105B Sarvam AI 🇮🇳 India 128K Jun 2025 Open 66.5%
Devstral Medium Mistral 🇫🇷 France 128K Jun 2025 $0.40 / $2.00 66.1% 62.4%
Gemma 3 27B Google 🇺🇸 USA 131K Mar 2025 $0.07 / $0.07 42.6% 11.4%
Nova Micro Amazon 🇺🇸 USA 128K Dec 2024 $0.04 / $0.14
Sarvam 30B Sarvam AI 🇮🇳 India 32K Jun 2025 Open

Model selection guides

Focused rankings for common searches — each guide uses the same data as this leaderboard.

Best LLM for coding SWE-Bench rankings Cheapest AI models API cost per 1M tokens GPT vs Claude vs Gemini Frontier comparison Best open-source LLM Open-weights models Best Indian LLMs in 2026 India-focused and multilingual models Best Multilingual LLM in 2026 Top models for cross-language and regional language support Largest context window Long-document LLMs All guides →

Frequently asked questions

What is an AI leaderboard?

An AI leaderboard is a ranking system that compares large language models (LLMs) across standardized benchmarks. LLM Leaderboard ranks 40+ models from providers like OpenAI, Anthropic, Google, xAI, DeepSeek, Alibaba, and Sarvam AI using benchmarks such as GPQA Diamond, SWE-Bench, AIME 2025, and Humanity's Last Exam.

Which is the best AI model in 2026?

There is no single best AI model — it depends on your use case. For coding, Claude Opus 4.7 and Claude Mythos Preview lead SWE-Bench. For reasoning, GPT-5.4 Pro and Gemini 3.1 Pro top GPQA Diamond. For cost efficiency, DeepSeek V4 Flash and Gemini 2.0 Flash offer the best value. For Indian language support, Sarvam AI models are purpose-built for multilingual performance.

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro — which should I use?

Choose Claude Opus 4.7 for complex coding and instruction-following. Choose GPT-5.5 for general-purpose reasoning, agentic workflows, and ecosystem integration. Choose Gemini 3.1 Pro for massive context windows (1M+ tokens), multimodal tasks, and cost-effective research. Many teams use a routing strategy — cheaper models for simple tasks, frontier models for complex ones.

What benchmarks does this LLM leaderboard track?

We track six key benchmarks: GPQA Diamond (graduate-level reasoning), AIME 2025 (competition math), SWE-Bench Verified (real-world coding), Humanity's Last Exam (expert general knowledge), ARC-AGI 2 (visual reasoning), and MMMLU (multilingual understanding). We also compare speed (tokens/sec) and API pricing per 1M tokens.

How often is this AI leaderboard updated?

Our AI leaderboard is updated daily with the latest benchmark scores, speed metrics, and pricing from model providers and independent evaluations like BenchLM and Artificial Analysis. When a new model is released, we add it within 24-48 hours.

What is Sarvam AI and how does it compare?

Sarvam AI is India's leading open-weight AI model provider. Their Sarvam 105B model is purpose-built for Indian languages and multilingual tasks. While frontier models like Claude and GPT lead on English-centric benchmarks, Sarvam models are designed for the Indian market with strong Hindi, Tamil, Telugu, and other regional language support.

About the AI leaderboard

LLM Leaderboard is an independent AI benchmark comparison platform that ranks 40+ large language models across reasoning, coding, math, vision, and multilingual performance. Our AI leaderboard helps developers, researchers, and businesses choose the best AI model for their use case by providing clear, data-driven rankings and benchmark-led model selection guidance.

We compare models from every major AI provider — including OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Claude Mythos), Google (Gemini 3.1 Pro), xAI (Grok 4.3), DeepSeek (R2, V4 Pro), Alibaba (Qwen 3.6), Meta (Llama 4), Mistral, Cohere, and Sarvam AI — all in one place.

Our leaderboard includes benchmark score history, live speed and cost metrics, and model comparison tools so you can evaluate GPT vs Claude vs Gemini vs Sarvam AI using the same objective data. This makes us a better search result for AI model benchmark comparison and LLM ranking queries.