AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 AI Can Learn Scientific Taste — score 95 Sources: huggingface

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capabil

Developer Tools

🔴 Grounding World Simulation Models in a Real-World Metropolis — score 75 Sources: huggingface

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the re

Infrastructure & Compute

🔴 Attention Residuals — score 85 Sources: huggingface

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which

🟡 Notable

Model Releases

🟡 Introducing GPT-5.4 mini and nano — score 50 Sources: lab_blog/OpenAI

GPT-5.4 mini and nano are smaller, faster versions of GPT-5.4 optimized for coding, tool use, multimodal reasoning, and high-volume API and sub-agent workloads.

🟡 OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first — score 50 Sources: lab_blog/OpenAI

OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens using generative AI.

🟡 Equipping workers with insights about compensation — score 50 Sources: lab_blog/OpenAI

New research shows Americans send nearly 3 million daily messages to ChatGPT asking about compensation and earnings, helping close the wage information gap.

🟡 Measuring progress toward AGI: A cognitive framework — score 50 Sources: lab_blog/DeepMind

We’re introducing a framework to measure progress toward AGI, and launching a Kaggle hackathon to build the relevant evaluations.

🟡 EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings — score 45 Sources: huggingface

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for

Developer Tools

🟡 OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data — score 65 Sources: huggingface

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fu

🟡 HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions — score 55 Sources: huggingface

We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physic

🟢 Incremental

Model Releases

🟢 Mixture-of-Depths Attention — score 35 Sources: huggingface

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixtur

🟢 Effective Distillation to Hybrid xLSTM Architectures — score 25 Sources: huggingface

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set ou

🟢 Safe and Scalable Web Agent Learning via Recreated Websites — score 5 Sources: huggingface

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning re

Developer Tools

🟢 Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models — score 15 Sources: huggingface

Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pat

📄 New Papers

Title	Category	Score	Link
AI Can Learn Scientific Taste	model_release	437	Open
Attention Residuals	infrastructure	189	Open
Grounding World Simulation Models in a Real-World Metropolis	developer_tool	157	Open
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data	developer_tool	155	Open
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions	developer_tool	154	Open
Interpretable Context Methodology: Folder Structure as Agentic Architecture	cs.AI	0	Open
EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context	cs.AI	0	Open
Residual Stream Duality in Modern Transformer Architectures	cs.AI	0	Open
Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition	cs.AI	0	Open
Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation	cs.AI	0	Open
POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLMs	cs.AI	0	Open
A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog	cs.AI	0	Open
ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning	cs.AI	0	Open
Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models	cs.AI	0	Open
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Introducing GPT-5.4 mini and nano
OpenAI: OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first
OpenAI: Equipping workers with insights about compensation
DeepMind: Measuring progress toward AGI: A cognitive framework

AI Watchtower Briefing — 2026-03-17

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts