Claude Sonnet 4.6, open-weight Qwen3.5-397B, Google launches Lyria 3

Anthropic hits hard with Claude Sonnet 4.6, a model that rivals Opus on many tasks at a Sonnet price. Meanwhile, Qwen publishes its first open-weight model Qwen3.5 with 397 billion parameters, and Google integrates Lyria 3 — its music generation model — directly into Gemini.

Claude Sonnet 4.6: Opus performance at Sonnet price

February 17 — Anthropic launches Claude Sonnet 4.6, described as the most capable Sonnet to date. The model represents a comprehensive upgrade on coding, computer use, long-context reasoning, agent planning, intellectual work, and design. It ships with a 1 million token context window in beta.

The positioning is clear: performances that would have required an Opus model are now accessible at the Sonnet rate, i.e., $3 /$ 15 per million tokens (unchanged from Sonnet 4.5). Sonnet 4.6 becomes the default model on Free and Pro plans in claude.ai and Claude Cowork.

Benchmarks and user feedback

In Claude Code, testers preferred Sonnet 4.6 to Sonnet 4.5 about 70% of the time, reporting better context reading before code modification and consolidation of shared logic instead of duplicating it. Even more notable: users preferred Sonnet 4.6 to Opus 4.5 (the frontier model of November 2025) 59% of the time, citing less over-engineering, less “laziness,” and better instruction following.

Benchmark	Score
SWE-bench Verified	80.2% (with prompt modification)
OSWorld (computer use)	Major progress over 16 months
OfficeQA	Equals Opus 4.6
Vending-Bench Arena	Emerging investment/pivot strategy

Computer use progresses significantly: Sonnet 4.6 also improves resistance to prompt injections compared to Sonnet 4.5, reaching a level comparable to Opus 4.6.

Associated product updates

The announcement comes with several general availability releases on the Claude API: code execution, memory, programmatic tool calls, tool search, and tool use examples. Web search and fetch tools now integrate dynamic filtering — Claude automatically writes and executes code to filter search results, keeping only relevant content in context.

🔗 Improved web search with dynamic filtering

For Claude in Excel users, the add-in now supports MCP connectors (S&P Global, LSEG, Daloopa, PitchBook, Moody’s, FactSet), available on Pro, Max, Team, and Enterprise plans.

🔗 Official announcement

Anthropic measures AI agent autonomy in real conditions

February 18 — Anthropic publishes a study analyzing millions of human-agent interactions across Claude Code and the public API, with one goal: to understand how humans handle agent autonomy in practice.

Key results

Metric	Value
Maximum autonomous duration (99.9th percentile)	~45 minutes (doubled in 3 months)
Auto-approve (experienced users)	40%+ (vs 20% for new ones)
Share of software engineering in API traffic	~50%
Actions with guardrails	80%
Actions with human in the loop	73%
Irreversible actions	0.8%

A counter-intuitive finding: experienced users increase both the auto-approve rate AND the interruption rate. They move from action-by-action supervision to active monitoring with targeted intervention. Moreover, Claude stops to ask for clarifications more often than humans interrupt it, particularly on complex tasks.

The study concludes that there is a significant gap between capability and usage: the autonomy that models are capable of managing largely exceeds what they are granted in practice — a phenomenon researchers call “undeployed autonomy surplus.”

🔗 Full study

Anthropic: Rwanda and Infosys partnerships

February 17 — Alongside the Sonnet 4.6 launch, Anthropic signs a memorandum of understanding with the government of Rwanda to deploy Claude in healthcare, education, and public administration sectors. The partnership, led with the Ministry of ICT and Innovation, includes training civil servants and deploying an AI learning companion in eight African countries.

Anthropic also announces a collaboration with Infosys to build AI agents intended for telecommunications and other regulated industries.

🔗 Rwanda Partnership

Qwen3.5-397B-A17B: first open-weight of the 3.5 series

February 16 — Alibaba Qwen releases Qwen3.5-397B-A17B, the first open-weight model of the Qwen3.5 series. It is a significant advance with a hybrid architecture combining linear attention and sparse Mixture-of-Experts (MoE).

Feature	Details
Total parameters	397B (hybrid MoE architecture)
Architecture	Hybrid linear attention + sparse MoE
Throughput	8.6x to 19.0x superior to Qwen3-Max
Languages	201 languages and dialects
License	Apache 2.0
Training	Large-scale reinforcement learning
Specialty	Native multimodal, real agents

The model is available immediately on Hugging Face, ModelScope, Alibaba Cloud Model Studio, and via Qwen Code. With 201 languages supported and an Apache 2.0 license, it is one of the most ambitious open-weight models of the moment in terms of linguistic coverage and inference throughput.

🔗 Tweet @Alibaba_Qwen

Google Lyria 3: music generation arrives in Gemini

February 18 — Google and DeepMind present Lyria 3, an AI music generation model integrated directly into the Gemini application. Users can create 30-second music tracks from text prompts, photos, or videos, with custom lyrics generation.

Feature	Details
Inputs	Text, images, videos
Output	30-second audio tracks
Customization	Varied musical styles, generated lyrics
Availability	Beta in Gemini (18+ years)

Lyria 3 demonstrates notable flexibility in instrument and genre combinations, allowing creations ranging from jingles to lo-fi compositions. Global deployment is progressive.

🔗 Tweet @GoogleAI

OpenAI EVMbench: security benchmark for smart contracts

February 18 — OpenAI and Paradigm launch EVMbench, a benchmark evaluating the ability of AI agents to detect, fix, and exploit vulnerabilities in Ethereum smart contracts. The benchmark relies on 120 curated vulnerabilities from 40 audits (mainly Code4rena competitions).

Mode	Description	GPT-5.3-Codex	GPT-5 (6 months)
Exploit	Execute drainage attacks	72.2%	31.9%
Detect	Audit and detect vulnerabilities	< complete coverage	-
Patch	Fix while preserving functionality	< complete coverage	-

An interesting finding: AI agents succeed better in exploitation (explicit objective) than in detection and correction, where they often give up after the first vulnerability found. OpenAI reaffirms its commitment of $10M in API credits for defensive cybersecurity.

🔗 EVMbench Announcement

GLM-5 Technical Report: Z.ai documents its model

February 18 — Z.ai publishes the GLM-5 full technical report, detailing the architectural innovations of the model launched on February 11 (744B parameters, 40B active, MIT License).

Three key innovations documented: Dynamic Sparse Attention (DSA) to reduce training and inference costs, an asynchronous RL infrastructure decoupling generation and training, and RL algorithms for agents allowing complex and long-horizon interactions. The report is available on arXiv.

🔗 Tweet @Zai_org · 🔗 arXiv

Cohere Labs Tiny Aya: ultra-compact multilingual AI

February 17 — Cohere Labs presents Tiny Aya, a family of small language models supporting 70+ languages with only 3.35 billion parameters. The goal: to make multilingual AI accessible everywhere, including on phones and offline.

Tiny Aya targets three audiences: researchers working in non-English languages, developers building for digitally underserved communities, and embedded applications requiring reliable translation without cloud dependency. The model includes an offline translation capability, improving privacy and reducing latency.

🔗 Tweet @cohere

Runway Gen-4.5 available via API + Claude Code Skill

February 17 — Runway opens access to Gen-4.5 via its API, allowing developers to integrate image, video, and audio generation directly into their projects. The announcement is accompanied by a dedicated Claude Code Skill, available on GitHub, which allows generating Runway multimedia content without leaving the development environment.

🔗 Tweet @runwayml · 🔗 GitHub Skills

Manus Agents: personal agent with long-term memory

February 16 — Manus launches Manus Agents, a capability allowing each user to have a personal agent directly in chat conversations. The agent combines long-term memory (style, tone, and retained preferences), full creation capabilities (videos, slides, sites, images), and direct integrations with Gmail, Calendar, and Notion.

🔗 Tweet @ManusAI

ElevenAgents for Support

February 17 — ElevenLabs launches ElevenAgents for Support, AI conversational agents for customer support. Operating in voice and digital channels in over 70 languages, these agents rely on the ElevenLabs agentic platform and its 4M+ deployments in production.

🔗 ElevenLabs Agents

NotebookLM x Zillow: real estate notebook

February 18 — NotebookLM launches in partnership with Zillow a free Featured Notebook for real estate buyers, centralizing expert advice on financial preparation, market assessment, and buying procedures.

🔗 Tweet @NotebookLM

What this means

This week illustrates two major trends. The first is the democratization of frontier performances: Sonnet 4.6 brings Opus capabilities at a rate 5 times lower, while Qwen3.5 makes a 397B parameter model accessible in Apache 2.0. The second is the expansion of AI agents into new areas — the Anthropic study shows that the longest autonomous sessions have doubled in three months, and players like Manus, ElevenLabs, and Runway are building specialized agents (personal chat, customer support, multimedia creation).

The arrival of music generation in Gemini with Lyria 3 and the EVMbench benchmark for blockchain security also show that generative AI and security AI continue to structure themselves as distinct fields.