๐ด High Significance
Model Releases
๐ด Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding โ score 95
Sources: huggingface
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark design
Developer Tools
๐ด Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents โ score 85
Sources: huggingface
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety
๐ด Learning to Retrieve from Agent Trajectories โ score 75
Sources: huggingface
Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, r
๐ก Notable
Model Releases
๐ก Featured An update on recent Claude Code quality reports We traced recent reports of Claude Code quality issues to three separate changes. Here's what happened and what we're changing. โ score 50
Sources: lab_blog/Anthropic
Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026 Quantifying infras
๐ก Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 โ score 50
Sources: lab_blog/Anthropic
Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compiler
๐ก Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 โ score 50
Sources: lab_blog/Anthropic
Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C co
๐ก Harness design for long-running application development Mar 24, 2026 โ score 50
Sources: lab_blog/Anthropic
Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C com
๐ก Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026 โ score 50
Sources: lab_blog/Anthropic
Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compil
Omitted 22 additional model releases items from the main section; see raw data and source-specific sections below.
Developer Tools
๐ก ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation โ score 65
Sources: huggingface
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are c
๐ก Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision โ score 55
Sources: huggingface
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which oft
๐ข Incremental
Model Releases
๐ข ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement โ score 20
Sources: huggingface
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes
๐ข How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings โ score 5
Sources: huggingface
Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly p
Developer Tools
๐ข Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning โ score 20
Sources: huggingface
In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned
Infrastructure & Compute
๐ข MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU โ score 35
Sources: huggingface
We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute eng
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding | model_release | 243 | Open |
| Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents | developer_tool | 124 | Open |
| Learning to Retrieve from Agent Trajectories | developer_tool | 78 | Open |
| ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation | developer_tool | 57 | Open |
| Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision | developer_tool | 51 | Open |
| EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions | cs.AI | 0 | Open |
| SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills | cs.AI | 0 | Open |
| On Emotion-Sensitive Decision Making of Small Language Model Agents | cs.AI | 0 | Open |
| AI-Driven Research for Databases | cs.AI | 0 | Open |
| LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources | cs.AI | 0 | Open |
| Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks | cs.AI | 0 | Open |
| Latent Structure of Affective Representations in Large Language Models | cs.AI | 0 | Open |
| Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs | cs.AI | 0 | Open |
| TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning | cs.AI | 0 | Open |
| The Detection-Extraction Gap: Models Know the Answer Before They Can Say It | cs.AI | 0 | Open |
๐ข Lab Blog Posts
- Anthropic: Featured An update on recent Claude Code quality reports We traced recent reports of Claude Code quality issues to three separate changes. Here's what happened and what we're changing.
- Anthropic: Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026
- Anthropic: Claude Code auto mode: a safer way to skip permissions Mar 25, 2026
- Anthropic: Harness design for long-running application development Mar 24, 2026
- Anthropic: Eval awareness in Claude Opus 4.6โs BrowseComp performance Mar 06, 2026
- Anthropic: Quantifying infrastructure noise in agentic coding evals Feb 05, 2026
- Anthropic: Building a C compiler with a team of parallel Claudes Feb 05, 2026
- Anthropic: Designing AI-resistant technical evaluations Jan 21, 2026
- Anthropic: Demystifying evals for AI agents Jan 09, 2026
- Anthropic: Effective harnesses for long-running agents Nov 26, 2025
- Anthropic: Introducing advanced tool use on the Claude Developer Platform Nov 24, 2025
- Anthropic: Code execution with MCP: Building more efficient agents Nov 04, 2025
- Anthropic: Beyond permission prompts: making Claude Code more secure and autonomous Oct 20, 2025
- Anthropic: Equipping agents for the real world with Agent Skills Oct 16, 2025
- Anthropic: Effective context engineering for AI agents Sep 29, 2025
- Anthropic: A postmortem of three recent issues Sep 17, 2025
- Anthropic: Writing effective tools for agents โ with agents Sep 11, 2025
- Anthropic: Desktop Extensions: One-click MCP server installation for Claude Desktop Jun 26, 2025
- Anthropic: How we built our multi-agent research system Jun 13, 2025
- Anthropic: Claude Code: Best practices for agentic coding Apr 18, 2025
- Anthropic: The "think" tool: Enabling Claude to stop and think in complex tool use situations Mar 20, 2025
- Anthropic: Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet Jan 06, 2025
- Anthropic: Building effective agents Dec 19, 2024
- Anthropic: Introducing Contextual Retrieval Sep 19, 2024
- OpenAI: The next phase of enterprise AI
- OpenAI: Introducing the Child Safety Blueprint