SOURCE // NEWS

Anthropic Launches Claude Sonnet 5, Bridging the Gap to Opus Series

Anthropic Launches Claude Sonnet 5, Bridging the Gap to Opus Series

Anthropic has officially launched Claude Sonnet 5. According to the corrected BrowseComp benchmark data, Sonnet 5 performs roughly on par with the flagship Opus 4.8, widening the gap with its predecessor, Sonnet 4.6. However, a key question remains: does the model remain cost-effective if it burns through more tokens to complete these complex tasks?

Anthropic pitches Sonnet 5 as its most 'agentic' Sonnet model yet. It is capable of constructing long-term plans, leveraging tools such as web browsers and terminal shells, and executing autonomous workflows. Only a few months ago, this level of autonomous execution was exclusively possible in larger, premium-tier models. Sonnet 5 aims to democratize these agentic capabilities at a highly competitive price point.

The published benchmarks demonstrate Sonnet 5's dominant lead over Sonnet 4.6 and its near-parity with Opus 4.8. On the coding-centric SWE-bench Pro, Sonnet 5 reached 63.2% (up from Sonnet 4.6's 58.1%, while Opus 4.8 sits at 69.2%). On Terminal-Bench 2.1, its score surged to 80.4% from 67.0%. For multidisciplinary reasoning on the Humanity's Last Exam with tools, Sonnet 5 scored 57.4%, virtually matching Opus 4.8's 57.9%. Additionally, in OS orchestration tests via OSWorld-Verified, Sonnet 5 achieved 81.2% compared to its predecessor's 78.5%.

Most notably, on the GDPval-AA v2 benchmark, which evaluates real-world knowledge work, Sonnet 5 edged past the larger Opus 4.8 with a score of 1,618 over 1,615. This leaps-and-bounds improvement in agentic behavior has been corroborated by early-access developers. Concurrently, facing regulatory headwinds that previously restricted the shipping of #Anthropic's highly capable Mythos 5 and Fable 5 due to national security concerns, Sonnet 5 was intentionally not trained on offensive cyber tasks. Its exploit generation capabilities, measured via tests like the Firefox 147 exploit evaluation showing only a 13.2% partial control rate, remain significantly lower than Mythos 5, ensuring smooth regulatory compliance.

[AgentUpdate Depth Analysis] The launch of Claude Sonnet 5 underscores a pivotal shift in the AI Agent ecosystem toward 'cost-efficient autonomy.' By prioritizing tool-use capabilities, terminal interaction, and multi-step planning over pure parameter scaling, Anthropic has successfully engineered a mid-tier model that challenges top-tier giants in agentic benchmarks like #SWE-bench Pro and #OSWorld. This approach directly lowers the barriers to entry for developing end-to-end agentic workflows, empowering tools like Cursor, Devin, and enterprise-grade automation agents. Compared horizontally to competitors focusing heavily on multi-hop logical reasoning (like OpenAI's o1 series), Anthropic is solidifying its dominance by optimizing 'Computer Use' and ecosystem-wide API orchestration. Ultimately, the long-term viability of agentic models hinges on the economics of inference. As Sonnet 5 steps into production environments, developers will closely monitor whether its agentic reliability can justify the inevitable token consumption, making token-efficiency the next critical frontier for agent design.