How often is the tracker updated?

The tracker is updated when new frontier models launch or existing models receive significant updates to pricing, benchmarks, or capabilities. Each update is logged here with details.

What kind of changes are tracked?

New model additions, benchmark score corrections, pricing updates, new features (selectable columns, cost calculator, API), and structural changes.

Changelog

Updates to the AI Frontier Model Tracker - new models, data corrections, benchmark additions, and feature changes.

July 25, 2026

Claude Opus 5 (Anthropic)

Added Claude Opus 5 (Anthropic, July 24) - a new flagship positioned as a daily-driver Opus that rivals Fable 5 at half the price ($5/$25 vs $10/$50). Anthropic reports it more than doubles Opus 4.8 on Frontier-Bench v0.1, comes within 0.5% of Fable 5 on CursorBench 3.2 at half the cost, and roughly 3x the next-best model on ARC-AGI 3. Standard SWE-bench / GPQA numbers not published at launch, so those fields are left blank. 1M context, multimodal, adaptive thinking.
91 models across 19 providers

July 20, 2026

Qwen3.8-Max-Preview (Alibaba)

Added Qwen3.8-Max-Preview (Alibaba, July 19) - the flagship Qwen3.8, at 2.4T parameters Qwen's first multimodal model above 1T, callable now via Alibaba's Token Plan / Qoder. Open weights promised "soon" (no date, no Hugging Face card yet); active-parameter count, context window, and per-token pricing undisclosed. Qwen positions it as "second only to Fable 5" on internal evaluations - no independent benchmarks published yet, so those fields are left blank.
Watch: Gemini 3.5 Pro still unreleased - no model card or official pricing.
90 models across 19 providers

July 18, 2026

Kimi K3 (Moonshot AI)

Added Kimi K3 (Moonshot AI, July 16) - a 2.8T-parameter open-weight multimodal reasoning MoE (896 experts, 16 active per token) built on Kimi Delta Attention, with 1M context. Independent testing ranks it #4 among all frontier models, behind only Claude Fable 5 and GPT-5.6 Sol and ahead of Opus 4.8; debuted #1 on LMArena's Frontend Code Arena (1679 Elo). $3/$15 per 1M tokens; full open weights due July 27.
Watch: Gemini 3.5 Pro still unreleased - missed its July 17 target (a third slip), with no model card or official pricing yet.
89 models across 19 providers

July 16, 2026

Inkling (Thinking Machines Lab)

Added Inkling (Thinking Machines Lab, July 15) - the lab's debut model and a new provider (Mira Murati's team). Sparse MoE with 975B total / 41B active parameters, pretrained on 45T tokens of text, image, audio, and video. 1M context, open weights on Hugging Face, fine-tunable via Tinker. The lab is candid that it is "not the strongest overall model available today," emphasizing multimodal breadth and efficient reasoning. Self-reported at effort=0.99 (not independently verified): GPQA Diamond 87.2, AIME 2026 97.1, SWE-bench Verified 77.6. No first-party per-token pricing disclosed yet.
Watch: Gemini 3.5 Pro still in preview, targeting July 17.
88 models across 19 providers

July 10, 2026

Grok 4.5, GPT-5.6 family (Sol, Terra, Luna)

Added Grok 4.5 (xAI, July 8) - new flagship positioned as Opus-class, built on the 1.5T V9 foundation and trained on real Cursor session data for coding and agentic work. Artificial Analysis Intelligence Index 54 at launch (#4 overall, up 16 points over Grok 4.3); xAI claims ~4.2x fewer output tokens than Opus 4.8 on SWE-Bench Pro. 500K context, $2/$6 per 1M.
Added the GPT-5.6 family (OpenAI, public July 9 after a gated preview from June 25) as three tiers, matching the GPT-5.5 pattern: Sol (flagship, SOTA on Terminal-Bench 2.1, $5/$30), Terra (mid, GPT-5.5-class at ~half the cost, $2.50/$15), and Luna (fastest and cheapest, $1/$6). All 1M context, with programmatic tool calling in the Responses API.
Watch: Gemini 3.5 Pro still in preview, targeting July 17.
87 models across 18 providers

July 8, 2026

Claude Sonnet 5; Fable 5 & Mythos 5 back online

Added Claude Sonnet 5 (June 30) - Anthropic's most agentic Sonnet, positioned close to Opus 4.8 at a lower price; replaced Sonnet 4.6 as the default on Claude.ai (Free/Pro). Humanity's Last Exam 34.6% (46.8% with tools), OSWorld-Verified 78.5%. 1M context, newer tokenizer (~30% more tokens). Intro pricing $2/$10 per 1M through Aug 31, then $3/$15 (batch 50% off). SWE-bench Verified and GPQA numbers vary across sources, pending system-card confirmation.
Claude Fable 5 and Mythos 5 redeployed (July 1) - the US lifted the June 12 export-control order, so both are available again across all platforms. Updated their entries.
Added LongCat-2.0 (Meituan, June 29) - a 1.6T open MoE agentic coding model (~48B active, 1M context) trained on a ~50,000-card domestic Chinese ASIC cluster. Ran anonymously as "Owl Alpha" and topped OpenRouter's coding usage charts for ~2 months before Meituan revealed it - added on demonstrated adoption. Vendor-reported benchmarks, independent verification pending: 59.5 SWE-Bench Pro, 70.8 Terminal-Bench 2.1, 77.3 SWE-Bench Multilingual. MIT license, full weights coming (INT8 posted). Meituan is the 18th provider.
Watch (not added): GPT-5.6 (Sol/Terra/Luna) previewed June 26 but gated to ~20 US-gov-approved orgs; Gemini 3.5 Pro delayed to July 17 (architecture rebuild); Grok 4.5 in private beta.
83 models across 18 providers

June 27, 2026

Doubao Seed 2.1 Pro & Turbo (ByteDance)

Added Doubao Seed 2.1 Pro (June 24) - ByteDance's first entry in the tracker. Flagship deep-thinking model for the coding/agent era, which ByteDance positions as comparable to GPT-5.5. Leads GDPval, MobileWorld, CharXiv-RQ, MeasureBench; #8 on Code Arena Frontend (1539, level with Claude Opus 4.6); top-tier on Agents' Last Exam. Multimodal (image + video). Closed, API via Doubao / Volcano Engine. ¥6/¥30 per 1M (~$0.83/$4.17). Parameters and context window not disclosed.
Added Doubao Seed 2.1 Turbo (June 24) - low-cost, low-latency sibling for high-volume enterprise deployment, reported comparable to Pro. ¥3/¥15 per 1M (~$0.42/$2.08).
Watch: Gemini 3.5 Pro (2M context, Deep Think) still in limited Vertex preview - no GA, no published specs/pricing/benchmarks, so not added.
81 models across 17 providers

June 21, 2026

Kimi K2.7 Code, GLM-5.2

Added Kimi K2.7 Code (June 12) - Moonshot AI's coding-specialized 1T MoE / 32B active, built on K2.6. Mandatory thinking mode, ~30% fewer reasoning tokens than K2.6; MoonViT vision encoder for image and video input; native INT4. Moonshot reports +21.8% on Kimi Code Bench v2 over K2.6 - no independent SWE-Bench numbers published yet. 256K context. Open-weight, Modified MIT. $0.95/$4.00.
Added GLM-5.2 (June 13) - Zhipu AI open-weight 753B MoE / 40B active. SWE-Bench Pro 62.1% (up from GLM-5.1's 58.4%), beating GPT-5.5 (58.6%). Terminal-Bench 2.1 81.0%, AIME 2026 99.2%, GPQA Diamond 91.2%, HLE 54.7% with tools. Selectable reasoning effort (High/Max). 1M context, up to 131K output. MIT license. $1.40/$4.40; GLM Coding Plan from ~$10/mo.
Noted June 12 export-control suspension of Claude Fable 5 and Mythos 5 - Anthropic pulled both across all platforms (Anthropic API, AWS Bedrock, Google Cloud, Microsoft Foundry, Snowflake, Box) after a US directive it could not enforce by nationality in real time.
79 models across 16 providers

June 11, 2026

Claude Mythos 5, Claude Fable 5

Added Claude Mythos 5 (June 9) - Anthropic's restricted Mythos-class frontier. Per System Card: 95.5% SWE-Bench Verified, 80.3% SWE-Bench Pro, 88.0% Terminal-Bench 2.1, 85.0% OSWorld-Verified, 94.1% GPQA Diamond (Anthropic now considers GPQA saturated), 59.0% HLE (64.5% w/tools), 88.0/93.3% BrowseComp single/multi-agent. 1M context. $10/$50. Restricted to Project Glasswing partners (US gov cyberdefenders) and select biomedical research orgs.
Added Claude Fable 5 (June 9) - same underlying model as Mythos 5, with safety classifiers that fall back to Opus 4.8 for cyber/bio/chem/distillation queries. Per System Card: 95% SWE-Verified, 80% SWE-Pro, 84.3% Terminal-Bench 2.1, 85.0% OSWorld, 29.3 FrontierCode Diamond, 1932 GDPval-AA, 57.9% OfficeQA Pro. Anthropic cites Stripe's 50M-line Ruby migration in one day. 1M context. $10/$50 (batch $5/$25). Pro/Max/Team include Fable 5 free through 2026-06-22. Available via Anthropic API, AWS Bedrock, GitHub Copilot.
Updated MiniMax M3 desc - weights commitment of ~June 11 has now slipped; huggingface.co/MiniMaxAI still showed only M2-series repos as of today.
77 models across 16 providers

June 4, 2026

MAI-Code-1-Flash, MAI-Thinking-1, MiniMax M3

Added MAI-Code-1-Flash (June 2) - Microsoft's first in-house coding model. 5B active sparse MoE (total parameter count not published on Microsoft's pages). 51.2% SWE-Bench Pro (+16 over Claude Haiku 4.5). +28.9 IF Bench, 85.8% adversarial accuracy. Solves harder problems with up to 60% fewer tokens on SWE-Verified. Rolling out to VS Code GitHub Copilot.
Added MAI-Thinking-1 (June 2) - Microsoft's first in-house reasoning model, not a distillation. 35B active sparse MoE (total not published). Per Build 2026 keynote: 52.8% SWE-Bench Pro (alongside Opus 4.6), 97.0% AIME 2025, 94.5% AIME 2026. 256K context. Private preview on Microsoft Foundry.
Added MiniMax M3 (June 1) - open-weights-committed model combining frontier coding, 1M context, and native multimodality (image + video). Weights and technical report pledged for ~June 11. Built on MiniMax Sparse Attention (MSA) - 9x faster prefill, 15x faster decoding at 1M context. 59.0% SWE-Bench Pro vs Opus 4.8's 69.2% (released four days earlier), 66.0% Terminal-Bench 2.1, 70.06% OSWorld-Verified, 74.2% MCP Atlas, 83.5 BrowseComp. $0.60/$2.40 standard ($0.30/$1.20 promotional at launch).
75 models across 16 providers

May 28, 2026

Claude Opus 4.8, Grok Build 0.1, Cohere Command A+, Qwen3.7-Max GA

Added Claude Opus 4.8 (May 28) - 88.6% SWE-Verified, 69.2% SWE-Pro, 93.6% GPQA Diamond, 83.4% OSWorld-Verified, 84% Online-Mind2Web. 1M context, $5/$25 standard, new Fast Mode at $10/$50 (3x cheaper than 4.7 Fast Mode). All benchmarks verified verbatim against the official Opus 4.8 System Card PDF.
Added Grok Build 0.1 (May 20) - first xAI model purpose-built for agentic coding. 256K context, text+image, $1/$2 per 1M tokens. 88.9% PinchBench.
Added Cohere Command A+ (May 20) - 218B sparse MoE / 25B active, Apache 2.0, multimodal (text+image+tool use). 128K input / 64K output. 75.1% MMMU, 80.6% MathVista, 37 Intelligence Index. Runs on two H100s at W4A4.
Updated Qwen3.7-Max (May 20) - graduated from preview at Alibaba Cloud Summit. 92.4% GPQA Diamond, $2.50/$7.50, cached input $0.25. Lowest hallucination rate among frontier models (22.9%).
72 models across 16 providers

April 25, 2026

GPT-5.5, DeepSeek V4, Tencent Hy3

Added GPT-5.5 (April 23) - first fully retrained base since GPT-4.5, 1M context, multimodal, $5/$30
Added GPT-5.5 Pro (April 24) - higher-accuracy variant, $30/$180
Added DeepSeek V4 Pro (April 24) - 1.6T/49B active, 90.1% GPQA Diamond, 80.6% SWE-bench, MIT license
Added DeepSeek V4 Flash (April 24) - 284B/13B active, 88.1% GPQA Diamond, $0.14/$0.28, MIT license
Added Tencent Hy3 Preview (April 23) - 295B/21B active, 87.2% GPQA Diamond, 74.4% SWE-bench, new provider
DeepSeek V4 and Hy3 benchmarks verified against HuggingFace model cards
GPT-5.5 benchmarks nulled pending primary source verification (OpenAI blog inaccessible)
66 models across 16 providers

April 23, 2026

7 new models, MCP server improvements

Added Claude Opus 4.7 (April 16) - 87.6% SWE-bench Verified, 94.2% GPQA Diamond, xhigh effort level, 3.75MP vision
Added GPT-Rosalind (April 16) - first domain-specific OpenAI model for life sciences
Added Grok 4.3 Beta (April 17) - 2M context, native video understanding, SuperGrok Heavy only ($300/mo)
Added Qwen3.6-Max-Preview (April 20) - top scores on 6 coding benchmarks, free preview on Bailian
Added Qwen3.6-27B (April 22) - dense 27B open-weight vision model, 87.8% GPQA Diamond, 77.2% SWE-bench, 83.9% LiveCodeBench v6, Apache 2.0
Added GPT-5.3-Codex (Feb 24) - agentic coding model, SOTA on SWE-Bench Pro, 400K context, $1.75/$14
Added Gemini 3.1 Flash Lite (March 3) - most cost-efficient Google model, 1M context, $0.25/$1.50 per 1M tokens
MCP server: improved tool descriptions with attribution URLs, reduced default limit from 50 to 25, disambiguated get/search tool pairs, added freshness cadence info
60 models across 15 providers

April 17, 2026

Qwen3.6-Plus and Qwen3.6 35B-A3B

Added Qwen3.6-Plus (April 2, 2026) - proprietary flagship with 1M native context, 78.8% SWE-bench Verified
Added Qwen3.6 35B-A3B (April 15, 2026) - open-weight MoE with 3B active / 35B total params, runs on laptops, Apache 2.0
44 models across 15 providers

April 12, 2026

Meta Muse Spark, Cohere Command A, nav restructure, production prep

Added Meta Muse Spark (April 8, 2026) - first model from Meta Superintelligence Labs
Added Cohere Command A (March 13, 2025) - 111B enterprise RAG model
All provider logos now local (Anthropic, Google, Meta, Qwen, AWS, Cohere, IBM)
Download weight icons from LobeHub (HuggingFace, Ollama dark mode)
Resources nav restructured as mega menu with Research & Data section
/research/ landing page created
Newsletter forms fixed (Pipedrive marketing_status, correct worker URL)
Subscribers differentiated: General Newsletter vs AI Model Tracker Updates
Exact release dates for all 42 models (verified against official sources)
Full site audit: 407 pages, 100% schema coverage, AEO avg 72
42 models across 14 providers

April 11, 2026

Selectable benchmark columns, new subpages, structural overhaul

10 benchmark columns now available (MMLU-Pro, GPQA Diamond, SWE-bench, HumanEval, LiveCodeBench, MATH, AIME, HLE, $/M In, $/M Out)
Column picker - choose up to 5 visible columns, persists in localStorage
Default sort changed to GPQA Diamond
Cost calculator page
JSON API endpoint (CC BY-NC 4.0)
Releases timeline page with RSS feed
Newsletter signup for tracker updates
Sticky filter bar and table headers
Hero redesigned with card links to subpages
All URLs moved to /research/demandsphere-radar/ai-frontier-model-tracker/
Proper breadcrumbs on all pages
Added to nav menu and resource center

April 10, 2026

Data corrections and new models

Full data audit - corrected 30+ benchmark scores, 6 release dates, 3 context windows, 4 pricing values
Added Gemma 4 31B, Gemma 4 26B-A4B, Qwen3.5 397B-A17B, GPT-5.4 Mini
Removed legacy models (GPT-4o, o1, GPT-5.2, Claude 3.5 Sonnet)
Citations tab added with per-model source links
Recent Releases feed section
Provider count now dynamic
Fixed license claims (Kimi K2 Apache->Modified MIT, Hermes 405B Apache->Llama 3.1)
Set unpublished scores to null (Llama 4 HumanEval/MATH, DeepSeek HumanEval)

April 9, 2026

Initial launch

40 frontier models across 14 providers
Sortable table with MMLU-Pro, HumanEval, MATH benchmarks
Expandable rows with Detail, Benchmarks chart, News feed, Download Weights
Filters: type, access, SOTA, multimodal
Provider logos for OpenAI, xAI, DeepSeek, Moonshot, MiniMax, Mistral, Microsoft, Nous
FAQ schema, OG image