AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections — score 85 Sources: huggingface

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authore

Developer Tools

🔴 Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training — score 95 Sources: huggingface

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows

🔴 IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse — score 75 Sources: huggingface

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grad

🟡 Notable

Model Releases

🟡 Video-Based Reward Modeling for Computer-Use Agents — score 65 Sources: huggingface

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independe

Developer Tools

🟡 ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation — score 55 Sources: huggingface

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in

🟡 XSkill: Continual Learning from Experience and Skills in Multimodal Agents — score 45 Sources: huggingface

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past traject

🟢 Incremental

Model Releases

🟢 DVD: Deterministic Video Depth Estimation with Generative Priors — score 25 Sources: huggingface

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to

🟢 Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation — score 10 Sources: huggingface

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In

Developer Tools

🟢 DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning — score 35 Sources: huggingface

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and ide

🟢 WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing — score 10 Sources: huggingface

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging

📄 New Papers

Title	Category	Score	Link
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training	developer_tool	95	Open
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections	model_release	69	Open
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse	developer_tool	57	Open
Video-Based Reward Modeling for Computer-Use Agents	model_release	46	Open
ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation	developer_tool	36	Open
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning	cs.AI	0	Open
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation	cs.AI	0	Open
Embedded Quantum Machine Learning in Embedded Systems: Feasibility, Hybrid Architectures, and Quantum Co-Processors	cs.AI	0	Open
CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning	cs.AI	0	Open
Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs	cs.AI	0	Open
Concentrated siting of AI data centers drives regional power-system stress under rising global compute demand	cs.AI	0	Open
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages	cs.AI	0	Open
Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents	cs.AI	0	Open
Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection	cs.AI	0	Open
Self-Flow-Matching assisted Full Waveform Inversion	cs.AI	0	Open

AI Watchtower Briefing — 2026-03-13

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers