Claude Opus 4.7 Outperforms GPT-5.4 and Gemini 3.1 Pro on SWE-bench and Agentic Reasoning

Brief summary: Anthropic has launched Claude Opus 4.7, a highly advanced model, which scores 64.3% on SWE-bench Pro, surpassing GPT-5.4’s 57.7%. It offers improved multi-agent coordination, higher image resolution, and enhanced multi-step reasoning with fewer errors. Priced at $5/$25 per million tokens, it’s accessible via Claude plans and platforms like Amazon Bedrock and Microsoft Foundry.

Anthropic has released Claude Opus 4.7, its most advanced model yet, showing leading performance in software engineering and agentic reasoning, outpacing OpenAI’s GPT-5.4 and Google’s Gemini 3.1 Pro on key tasks for developers and enterprise users.

The launch happens amid Anthropic’s significant commercial growth. The company is operating at a $30 billion annual revenue rate, attracting investment interest around $800 billion, and is in initial IPO talks. Opus 4.7 aims to meet these expectations by becoming the preferred choice for enterprises and developers, focusing not just on benchmarks but on real-world applicability.

Areas of Excellence

The standout figures pertain to software engineering. On the SWE-bench Pro benchmark, testing a model’s problem-solving on real-world software issues, Opus 4.7 scores 64.3%, an improvement from Opus 4.6’s 53.4%, and ahead of GPT-5.4’s 57.7% and Gemini 3.1 Pro’s 54.2%. On SWE-bench Verified, it scores 87.6%, surpassing its predecessor’s 80.8% and Gemini 3.1 Pro’s 80.6%.

CursorBench, evaluating autonomous coding in AI code editors, shows improvement from Opus 4.6’s 58% to 70%. Already the default in Cursor and Claude Code, this progress in a benchmark closely related to developer use is crucial. Claude Code achieved $2.5 billion in annual revenue in February, and AI-assisted coding is a rapidly growing software sector.

In graduate-level reasoning measured by GPQA Diamond, results show convergence: Opus 4.7 at 94.2%, GPT-5.4 Pro at 94.4%, and Gemini 3.1 Pro at 94.3%. These differences are minimal, indicating that competitive differentiation is moving from reasoning scores to applied performance on complex tasks.

Enhancements in Agentic Reasoning

Opus 4.7’s key improvements might not fully show in singular benchmarks. Anthropic claims a 14% improvement over Opus 4.6 in complex multi-step workflows, using fewer tokens and reducing tool errors by a third. It’s the first Claude model to pass “implicit-need tests,” needing the model to deduce required tools or actions without explicit instructions.

The model also adds multi-agent coordination, orchestrating parallel AI workstreams instead of sequential task processing. For enterprises using Claude for tasks like code review and data processing, this boosts throughput. Anthropic asserts Opus 4.7 can maintain focus during extended workflows, addressing common issues with model coherence over long tasks.

Resilience is another focus, with the model continuing operations through tool failures that would halt Opus 4.6, adapting instead of stopping, vital for automated processes where one failure can cause significant disruptions.

Visual and Contextual Upgrades

Opus 4.7 supports image resolutions up to 2,576 pixels on the long side, over triple its predecessors’ capacity, targeting enterprise document analysis where details are crucial. The context window remains at a million tokens, half of Gemini 3.1 Pro’s capacity but adequate for most enterprise needs. On long-context research benchmarks, Opus 4.7 equals the top overall score of 0.715 and delivers the most consistent performance.

Anthropic notes the model follows instructions more precisely than before, which might require prompt adjustments from users. This change reduces ambiguity but also limits unexpected outputs, decreasing off-task behaviour and hallucinations in enterprise settings.