๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding โ€” score 95 Sources: huggingface

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark design

Developer Tools

๐Ÿ”ด Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents โ€” score 85 Sources: huggingface

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety

๐Ÿ”ด Learning to Retrieve from Agent Trajectories โ€” score 75 Sources: huggingface

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, r

๐ŸŸก Notable

Model Releases

๐ŸŸก Featured An update on recent Claude Code quality reports We traced recent reports of Claude Code quality issues to three separate changes. Here's what happened and what we're changing. โ€” score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โ€™s BrowseComp performance Mar 06, 2026 Quantifying infras

๐ŸŸก Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 โ€” score 50 Sources: lab_blog/Anthropic

Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โ€™s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compiler

๐ŸŸก Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 โ€” score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Harness design for long-running application development Mar 24, 2026 Eval awareness in Claude Opus 4.6โ€™s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C co

๐ŸŸก Harness design for long-running application development Mar 24, 2026 โ€” score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Eval awareness in Claude Opus 4.6โ€™s BrowseComp performance Mar 06, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C com

๐ŸŸก Eval awareness in Claude Opus 4.6โ€™s BrowseComp performance Mar 06, 2026 โ€” score 50 Sources: lab_blog/Anthropic

Scaling Managed Agents: Decoupling the brain from the hands Apr 08, 2026 Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 Harness design for long-running application development Mar 24, 2026 Quantifying infrastructure noise in agentic coding evals Feb 05, 2026 Building a C compil

Omitted 22 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

๐ŸŸก ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation โ€” score 65 Sources: huggingface

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are c

๐ŸŸก Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision โ€” score 55 Sources: huggingface

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which oft

๐ŸŸข Incremental

Model Releases

๐ŸŸข ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement โ€” score 20 Sources: huggingface

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes

๐ŸŸข How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings โ€” score 5 Sources: huggingface

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly p

Developer Tools

๐ŸŸข Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning โ€” score 20 Sources: huggingface

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned

Infrastructure & Compute

๐ŸŸข MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU โ€” score 35 Sources: huggingface

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute eng

๐Ÿ“„ New Papers

TitleCategoryScoreLink
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understandingmodel_release243Open
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agentsdeveloper_tool124Open
Learning to Retrieve from Agent Trajectoriesdeveloper_tool78Open
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generationdeveloper_tool57Open
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervisiondeveloper_tool51Open
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptionscs.AI0Open
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skillscs.AI0Open
On Emotion-Sensitive Decision Making of Small Language Model Agentscs.AI0Open
AI-Driven Research for Databasescs.AI0Open
LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sourcescs.AI0Open
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networkscs.AI0Open
Latent Structure of Affective Representations in Large Language Modelscs.AI0Open
Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMscs.AI0Open
TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learningcs.AI0Open
The Detection-Extraction Gap: Models Know the Answer Before They Can Say Itcs.AI0Open

๐Ÿข Lab Blog Posts