AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding — score 95 Sources: huggingface

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark design

Developer Tools

🔴 Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents — score 85 Sources: huggingface

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety

🔴 Learning to Retrieve from Agent Trajectories — score 75 Sources: huggingface

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, r

🟡 Notable

Model Releases

🟡 Featured An update on recent Claude Code quality reports We traced recent reports of Claude Code quality issues to three separate changes. Here's what happened and what we're changing. — score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026 Quantifying infras

🟡 Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 — score 50 Sources: lab_blog/Anthropic

Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compiler

🟡 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 — score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C co

🟡 Harness design for long-running application development Mar 24, 2026 — score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C com

🟡 Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026 — score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compil

Omitted 22 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

🟡 ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation — score 65 Sources: huggingface

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are c

🟡 Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision — score 55 Sources: huggingface

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which oft

🟢 Incremental

Model Releases

🟢 ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement — score 20 Sources: huggingface

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes

🟢 How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings — score 5 Sources: huggingface

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly p

Developer Tools

🟢 Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning — score 20 Sources: huggingface

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned

Infrastructure & Compute

🟢 MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU — score 35 Sources: huggingface

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute eng

📄 New Papers

Title	Category	Score	Link
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding	model_release	243	Open
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents	developer_tool	124	Open
Learning to Retrieve from Agent Trajectories	developer_tool	78	Open
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation	developer_tool	57	Open
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision	developer_tool	51	Open
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions	cs.AI	0	Open
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills	cs.AI	0	Open
On Emotion-Sensitive Decision Making of Small Language Model Agents	cs.AI	0	Open
AI-Driven Research for Databases	cs.AI	0	Open
LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources	cs.AI	0	Open
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks	cs.AI	0	Open
Latent Structure of Affective Representations in Large Language Models	cs.AI	0	Open
Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs	cs.AI	0	Open
TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning	cs.AI	0	Open
The Detection-Extraction Gap: Models Know the Answer Before They Can Say It	cs.AI	0	Open

🏢 Lab Blog Posts

Anthropic: Featured An update on recent Claude Code quality reports We traced recent reports of Claude Code quality issues to three separate changes. Here's what happened and what we're changing.
Anthropic: Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026
Anthropic: Claude Code auto mode: a safer way to skip permissions Mar 25, 2026
Anthropic: Harness design for long-running application development Mar 24, 2026
Anthropic: Eval awareness in Claude Opus 4.6’s BrowseComp performance Mar 06, 2026
Anthropic: Quantifying infrastructure noise in agentic coding evals Feb 05, 2026
Anthropic: Building a C compiler with a team of parallel Claudes Feb 05, 2026
Anthropic: Designing AI-resistant technical evaluations Jan 21, 2026
Anthropic: Demystifying evals for AI agents Jan 09, 2026
Anthropic: Effective harnesses for long-running agents Nov 26, 2025
Anthropic: Introducing advanced tool use on the Claude Developer Platform Nov 24, 2025
Anthropic: Code execution with MCP: Building more efficient agents Nov 04, 2025
Anthropic: Beyond permission prompts: making Claude Code more secure and autonomous Oct 20, 2025
Anthropic: Equipping agents for the real world with Agent Skills Oct 16, 2025
Anthropic: Effective context engineering for AI agents Sep 29, 2025
Anthropic: A postmortem of three recent issues Sep 17, 2025
Anthropic: Writing effective tools for agents — with agents Sep 11, 2025
Anthropic: Desktop Extensions: One-click MCP server installation for Claude Desktop Jun 26, 2025
Anthropic: How we built our multi-agent research system Jun 13, 2025
Anthropic: Claude Code: Best practices for agentic coding Apr 18, 2025
Anthropic: The "think" tool: Enabling Claude to stop and think in complex tool use situations Mar 20, 2025
Anthropic: Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet Jan 06, 2025
Anthropic: Building effective agents Dec 19, 2024
Anthropic: Introducing Contextual Retrieval Sep 19, 2024
OpenAI: The next phase of enterprise AI
OpenAI: Introducing the Child Safety Blueprint

AI Watchtower Briefing — 2026-04-08

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

📄 New Papers

🏢 Lab Blog Posts