AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders — score 95 Sources: huggingface

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing pra

Developer Tools

🔴 BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning — score 85 Sources: huggingface

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of l

🔴 Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model — score 75 Sources: huggingface

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning rem

🟡 Notable

Model Releases

🟡 Progressive Residual Warmup for Language Model Pretraining — score 55 Sources: huggingface

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language

🟡 Reasoning Models Struggle to Control their Chains of Thought — score 45 Sources: huggingface

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT control

Developer Tools

🟡 WildActor: Unconstrained Identity-Preserving Video Generation — score 65 Sources: huggingface

Production-ready human video generation requires digital actors to maintain strictly consistent full-body identities across dynamic shots, viewpoints and motions, a setting that remains challenging for existing methods. Prior methods often suffer from face-centric behavior that neglects body-level c

🟡 OpenAI to acquire Promptfoo — score 50 Sources: lab_blog/OpenAI

OpenAI is acquiring Promptfoo, an AI security platform that helps enterprises identify and remediate vulnerabilities in AI systems during development.

Other Signals

🟡 From games to biology and beyond: 10 years of AlphaGo’s impact — score 50 Sources: lab_blog/DeepMind

Ten years since AlphaGo, we explore how it is catalyzing scientific discovery and paving a path to AGI.

🟢 Incremental

Model Releases

🟢 Physical Simulator In-the-Loop Video Generation — score 15 Sources: huggingface

Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints

🟢 HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel — score 5 Sources: huggingface

Sequential LLM agents fail on long-horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic c

Developer Tools

🟢 RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies — score 35 Sources: huggingface

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluati

🟢 Dynamic Chunking Diffusion Transformer — score 25 Sources: huggingface

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process pro

📄 New Papers

Title	Category	Score	Link
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders	model_release	124	Open
BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning	developer_tool	63	Open
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model	developer_tool	44	Open
WildActor: Unconstrained Identity-Preserving Video Generation	developer_tool	43	Open
Progressive Residual Warmup for Language Model Pretraining	model_release	41	Open
SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans	cs.AI	0	Open
Slumbering to Precision: Enhancing Artificial Neural Network Calibration Through Sleep-like Processes	cs.AI	0	Open
Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models	cs.AI	0	Open
Learning When to Trust in Contextual Bandits	cs.AI	0	Open
CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases	cs.AI	0	Open
Joint Return and Risk Modeling with Deep Neural Networks for Portfolio Construction	cs.AI	0	Open
Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference	cs.AI	0	Open
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?	cs.AI	0	Open
Visualizing Coalition Formation: From Hedonic Games to Image Segmentation	cs.AI	0	Open
A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: OpenAI to acquire Promptfoo
DeepMind: From games to biology and beyond: 10 years of AlphaGo’s impact

AI Watchtower Briefing — 2026-03-09

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Other Signals

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts