AW · AI Watchtower

🔴 High Significance

Developer Tools

🔴 Does Your Reasoning Model Implicitly Know When to Stop Thinking? — score 95 Sources: huggingface

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-ti

🔴 VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training — score 85 Sources: huggingface

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse.

🔴 Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control — score 75 Sources: huggingface

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is

🟡 Notable

Model Releases

🟡 Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers — score 55 Sources: huggingface

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off mode

Developer Tools

🟡 EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots — score 65 Sources: huggingface

Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush,

🟡 OpenAI announces Frontier Alliance Partners — score 50 Sources: lab_blog/OpenAI

OpenAI announces Frontier Alliance Partners to help enterprises move from AI pilots to production with secure, scalable agent deployments.

Infrastructure & Compute

🟡 Why we no longer evaluate SWE-bench Verified — score 50 Sources: lab_blog/OpenAI

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

🟡 Spanning the Visual Analogy Space with a Weight Basis of LoRAs — score 45 Sources: huggingface

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-t

🟢 Incremental

Developer Tools

🟢 VidEoMT: Your ViT is Secretly Also a Video Segmentation Model — score 20 Sources: huggingface

Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders,

🟢 SARAH: Spatially Aware Real-time Agentic Humans — score 20 Sources: huggingface

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the f

🟢 Sink-Aware Pruning for Diffusion Language Models — score 5 Sources: huggingface

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that thi

Infrastructure & Compute

🟢 DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning — score 35 Sources: huggingface

Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior r

📄 New Papers

Title	Category	Score	Link
Does Your Reasoning Model Implicitly Know When to Stop Thinking?	developer_tool	275	Open
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training	developer_tool	229	Open
Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control	developer_tool	35	Open
EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots	developer_tool	24	Open
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers	model_release	20	Open
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement	cs.AI	0	Open
Hilbert-Augmented Reinforcement Learning for Scalable Multi-Robot Coverage and Exploration	cs.AI	0	Open
Model Merging in the Essential Subspace	cs.AI	0	Open
Redefining the Down-Sampling Scheme of U-Net for Precision Biomedical Image Segmentation	cs.AI	0	Open
IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking	cs.AI	0	Open
SIGMAS: Second-Order Interaction-based Grouping for Overlapping Multi-Agent Swarms	cs.AI	0	Open
FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture	cs.AI	0	Open
OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents	cs.AI	0	Open
When AI Teammates Meet Code Review: Collaboration Signals Shaping the Integration of Agent-Authored Pull Requests	cs.AI	0	Open
Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Why we no longer evaluate SWE-bench Verified
OpenAI: OpenAI announces Frontier Alliance Partners

AI Watchtower Briefing — 2026-02-23

🔴 High Significance

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Infrastructure & Compute

🟢 Incremental

Developer Tools

Infrastructure & Compute

📄 New Papers

🏢 Lab Blog Posts