llmleaderboard.in
LLM Leaderboard 2026 — Compare AI Models by Benchmarks, Speed & Price
Track and compare the latest benchmark performance of 40+ frontier AI models. Data sourced from model providers, Artificial Analysis, BenchLM, and independently run evaluations.
LLM Leaderboard delivers clear, up-to-date rankings for reasoning, math, coding, vision, and multilingual performance while also showing speed and cost metrics — covering models from Anthropic, OpenAI, Google, Meta, xAI, DeepSeek, Alibaba, Mistral, Cohere, and more.
AI benchmark rankings by task
- Claude Mythos Preview — 94.6%
- GPT-5.4 Pro — 94.5%
- Gemini 3.1 Pro — 94.3%
- Claude Opus 4.7 — 94.2%
- GPT-5.5 — 93.6%
Historical progress
Tracking the progression of state-of-the-art models on the GPQA Diamond benchmark from 2023 to 2026.
Speed & affordability
Compare AI models head-to-head
| Attribute | — | — |
|---|
Full LLM leaderboard — all AI models
| Model ↕ | Provider ↕ | Country ↕ | Context ↕ | Cutoff ↕ | I/O Cost ↕ | GPQA ↕ | SWE-Bench ↕ | Speed ↕ |
|---|---|---|---|---|---|---|---|---|
| Claude Mythos Preview | Anthropic | 🇺🇸 USA | 1M | Apr 2026 | Limited | 94.6% | 93.9% | — |
| GPT-5.4 Pro | OpenAI | 🇺🇸 USA | 1M | Mar 2026 | $30 / $180 | 94.5% | 80.2% | — |
| Claude Opus 4.8 | Anthropic | 🇺🇸 USA | 1M | Jun 2026 | $6 / $30 | 94.4% | 93.7% | — |
| Gemini 3.1 Pro | 🇺🇸 USA | 1M | Feb 2026 | $2 / $12 | 94.3% | 80.6% | — | |
| Claude Opus 4.7 | Anthropic | 🇺🇸 USA | 200K | May 2025 | $5 / $25 | 94.2% | 82% | 67 t/s |
| GPT-5.5 Pro | OpenAI | 🇺🇸 USA | 1M | Apr 2026 | $30 / $180 | 94.2% | 81% | — |
| MAI-Thinking-1 | Microsoft | 🇺🇸 USA | 256K | Jun 2026 | Private preview | 93.8% | 80.8% | — |
| GPT-5.5 | OpenAI | 🇺🇸 USA | 1M | Apr 2026 | $5 / $30 | 93.6% | 78.6% | — |
| GPT-5.4 | OpenAI | 🇺🇸 USA | 1M | Mar 2026 | $5 / $30 | 92.8% | 77.4% | — |
| Gemini 3 Pro | 🇺🇸 USA | 2M | Mar 2026 | $3.5 / $10.5 | 92.1% | 76.3% | — | |
| Claude Opus 4.6 | Anthropic | 🇺🇸 USA | 1M | May 2025 | $5 / $25 | 91.2% | 80.8% | 67 t/s |
| Kimi K2.6 | Moonshot | 🇨🇳 China | 256K | Apr 2026 | $0.75 / $3.50 | 91.1% | 80.2% | — |
| DeepSeek R2 | DeepSeek | 🇨🇳 China | 128K | Feb 2026 | $0.55 / $2.19 | 89.3% | 72.4% | — |
| Claude Sonnet 4.6 | Anthropic | 🇺🇸 USA | 1M | Aug 2025 | $3 / $15 | 88.5% | 74.2% | 55 t/s |
| Grok 4.3 | xAI | 🇺🇸 USA | 256K | Apr 2026 | $1.25 / $2.50 | 88% | 74.5% | 203 t/s |
| Qwen 3.6 Plus | Alibaba | 🇨🇳 China | 128K | Apr 2026 | $1.50 / $4.50 | 87.4% | 78.8% | — |
| DeepSeek V4 Pro | DeepSeek | 🇨🇳 China | 1M | Mar 2026 | $0.30 / $0.50 | 87.1% | 81% | — |
| Gemini 3.1 Flash-Lite | 🇺🇸 USA | 1M | Jan 2025 | $0.25 / $1.50 | 86.9% | 62.8% | 363 t/s | |
| Kimi K2 Thinking | Moonshot | 🇨🇳 China | 128K | Jan 2026 | $2 / $6 | 86.7% | 69.1% | — |
| GPT-5.5 Instant | OpenAI | 🇺🇸 USA | 400K | Aug 2025 | $5 / $30 | 85.6% | — | 145 t/s |
| Grok 4 | xAI | 🇺🇸 USA | 256K | Feb 2026 | $3 / $15 | 85.2% | 66.4% | — |
| Claude Opus 4.5 | Anthropic | 🇺🇸 USA | 200K | Feb 2025 | $15 / $75 | 84.8% | 70.3% | 45 t/s |
| Qwen 3.5 397B | Alibaba | 🇨🇳 China | 128K | Feb 2026 | Open | 84.2% | 72.1% | — |
| GLM-5 | Zhipu AI | 🇨🇳 China | 128K | Feb 2026 | $1.00 / $3.20 | 83.5% | 77.8% | — |
| GPT-4.1 | OpenAI | 🇺🇸 USA | 1M | Jun 2025 | $2 / $8 | 82.4% | 68.9% | — |
| Gemini 2.5 Flash | 🇺🇸 USA | 1M | Jan 2026 | $0.15 / $0.6 | 80.3% | 61.4% | 780 t/s | |
| Llama 4 Maverick | 🇺🇸 USA | 1M | Dec 2025 | Open | 80.1% | 62.3% | — | |
| DeepSeek V4 Flash | DeepSeek | 🇨🇳 China | 1M | Mar 2026 | $0.08 / $0.28 | 79.4% | 68.2% | — |
| MiniMax M2 | MiniMax | 🇨🇳 China | 128K | Nov 2025 | $0.30 / $1.20 | 78.2% | 68.5% | — |
| GPT-5.4 Mini | OpenAI | 🇺🇸 USA | 1M | Mar 2026 | $0.75 / $3 | 78.1% | 58.4% | — |
| Qwen 3 235B | Alibaba | 🇨🇳 China | 128K | Apr 2025 | Open | 76.8% | 60.5% | — |
| Llama 4 Scout | 🇺🇸 USA | 10M | Dec 2025 | Open | 76.5% | 55.8% | 2,600 t/s | |
| Mistral Medium 3.5 | Mistral | 🇫🇷 France | 256K | Apr 2026 | $1.50 / $7.50 | 74.8% | 77.6% | — |
| Mistral Large 3 | Mistral | 🇫🇷 France | 256K | Oct 2025 | $0.50 / $1.50 | 74.3% | 58.1% | — |
| Gemini 2.0 Flash | 🇺🇸 USA | 1M | Sep 2025 | $0.1 / $0.4 | 74.1% | 53.2% | 520 t/s | |
| Grok 4.1 Fast | xAI | 🇺🇸 USA | 128K | Mar 2026 | $0.20 / $0.50 | 72.8% | 51.2% | 350 t/s |
| Cohere Command A | Cohere | 🇨🇦 Canada | 128K | Mar 2025 | $2.50 / $10 | 72.4% | 55.8% | — |
| Claude Haiku 4.5 | Anthropic | 🇺🇸 USA | 200K | Aug 2025 | $1 / $5 | 72.1% | 58.3% | 120 t/s |
| DeepSeek R1 | DeepSeek | 🇨🇳 China | 128K | Dec 2024 | $0.55 / $2.19 | 71.5% | 49.2% | — |
| GPT-4.1 mini | OpenAI | 🇺🇸 USA | 1M | Jun 2025 | $0.4 / $1.6 | 71.2% | 52.1% | — |
| Mistral Small 4 | Mistral | 🇫🇷 France | 256K | Feb 2026 | $0.15 / $0.60 | 68.9% | 49.2% | 137 t/s |
| GPT-5.4 Nano | OpenAI | 🇺🇸 USA | 512K | Mar 2026 | $0.20 / $1.25 | 68.5% | 42.1% | — |
| Sarvam 105B | Sarvam AI | 🇮🇳 India | 128K | Jun 2025 | Open | 66.5% | — | — |
| Devstral Medium | Mistral | 🇫🇷 France | 128K | Jun 2025 | $0.40 / $2.00 | 66.1% | 62.4% | — |
| Gemma 3 27B | 🇺🇸 USA | 131K | Mar 2025 | $0.07 / $0.07 | 42.6% | 11.4% | — | |
| Nova Micro | Amazon | 🇺🇸 USA | 128K | Dec 2024 | $0.04 / $0.14 | — | — | — |
| Sarvam 30B | Sarvam AI | 🇮🇳 India | 32K | Jun 2025 | Open | — | — | — |
Model selection guides
Focused rankings for common searches — each guide uses the same data as this leaderboard.
Frequently asked questions
What is an AI leaderboard?
An AI leaderboard is a ranking system that compares large language models (LLMs) across standardized benchmarks. LLM Leaderboard ranks 40+ models from providers like OpenAI, Anthropic, Google, xAI, DeepSeek, Alibaba, and Sarvam AI using benchmarks such as GPQA Diamond, SWE-Bench, AIME 2025, and Humanity's Last Exam.
Which is the best AI model in 2026?
There is no single best AI model — it depends on your use case. For coding, Claude Opus 4.7 and Claude Mythos Preview lead SWE-Bench. For reasoning, GPT-5.4 Pro and Gemini 3.1 Pro top GPQA Diamond. For cost efficiency, DeepSeek V4 Flash and Gemini 2.0 Flash offer the best value. For Indian language support, Sarvam AI models are purpose-built for multilingual performance.
GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro — which should I use?
Choose Claude Opus 4.7 for complex coding and instruction-following. Choose GPT-5.5 for general-purpose reasoning, agentic workflows, and ecosystem integration. Choose Gemini 3.1 Pro for massive context windows (1M+ tokens), multimodal tasks, and cost-effective research. Many teams use a routing strategy — cheaper models for simple tasks, frontier models for complex ones.
What benchmarks does this LLM leaderboard track?
We track six key benchmarks: GPQA Diamond (graduate-level reasoning), AIME 2025 (competition math), SWE-Bench Verified (real-world coding), Humanity's Last Exam (expert general knowledge), ARC-AGI 2 (visual reasoning), and MMMLU (multilingual understanding). We also compare speed (tokens/sec) and API pricing per 1M tokens.
How often is this AI leaderboard updated?
Our AI leaderboard is updated daily with the latest benchmark scores, speed metrics, and pricing from model providers and independent evaluations like BenchLM and Artificial Analysis. When a new model is released, we add it within 24-48 hours.
What is Sarvam AI and how does it compare?
Sarvam AI is India's leading open-weight AI model provider. Their Sarvam 105B model is purpose-built for Indian languages and multilingual tasks. While frontier models like Claude and GPT lead on English-centric benchmarks, Sarvam models are designed for the Indian market with strong Hindi, Tamil, Telugu, and other regional language support.
About the AI leaderboard
LLM Leaderboard is an independent AI benchmark comparison platform that ranks 40+ large language models across reasoning, coding, math, vision, and multilingual performance. Our AI leaderboard helps developers, researchers, and businesses choose the best AI model for their use case by providing clear, data-driven rankings and benchmark-led model selection guidance.
We compare models from every major AI provider — including OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Claude Mythos), Google (Gemini 3.1 Pro), xAI (Grok 4.3), DeepSeek (R2, V4 Pro), Alibaba (Qwen 3.6), Meta (Llama 4), Mistral, Cohere, and Sarvam AI — all in one place.
Our leaderboard includes benchmark score history, live speed and cost metrics, and model comparison tools so you can evaluate GPT vs Claude vs Gemini vs Sarvam AI using the same objective data. This makes us a better search result for AI model benchmark comparison and LLM ranking queries.