AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning — score 85 Sources: huggingface

VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, m

Developer Tools

🔴 Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models — score 95 Sources: huggingface

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coup

🔴 TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation — score 75 Sources: huggingface

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospa

🟡 Notable

Model Releases

🟡 ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models — score 55 Sources: huggingface

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce

Developer Tools

🟡 Hyperagents — score 65 Sources: huggingface

Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwi

🟡 Creating with Sora Safely — score 50 Sources: lab_blog/OpenAI

To address the novel safety challenges posed by a state-of-the-art video model as well as a new social creation platform, we’ve built Sora 2 and the Sora app with safety at the foundation. Our approach is anchored in concrete protections.

🟡 The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus — score 45 Sources: huggingface

LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL)

🟢 Incremental

Model Releases

🟢 A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — score 5 Sources: huggingface

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly cha

Developer Tools

🟢 FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow — score 35 Sources: huggingface

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style

🟢 LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation — score 20 Sources: huggingface

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods

🟢 Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck — score 20 Sources: huggingface

Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient rea

📄 New Papers

Title	Category	Score	Link
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models	developer_tool	116	Open
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning	model_release	114	Open
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation	developer_tool	56	Open
Hyperagents	developer_tool	55	Open
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models	model_release	45	Open
When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models	cs.AI	0	Open
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment	cs.AI	0	Open
Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns	cs.AI	0	Open
Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems	cs.AI	0	Open
Alignment as Institutional Design: From Behavioral Correction to Transaction Structure in Intelligent Systems	cs.AI	0	Open
Effective Strategies for Asynchronous Software Engineering Agents	cs.AI	0	Open
RuntimeSlicer: Towards Generalizable Unified Runtime State Representation for Failure Management	cs.AI	0	Open
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems	cs.AI	0	Open
Implicit Humanization in Everyday LLM Moral Judgments	cs.AI	0	Open
Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Creating with Sora Safely

AI Watchtower Briefing — 2026-03-23

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts