AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers — score 95 Sources: huggingface

OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming

🔴 Terminal Agents Suffice for Enterprise Automation — score 85 Sources: huggingface

There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through grap

Developer Tools

🔴 MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome — score 75 Sources: huggingface

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthe

🟡 Notable

Model Releases

🟡 Embarrassingly Simple Self-Distillation Improves Code Generation — score 65 Sources: huggingface

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation config

🟡 Codex now offers more flexible pricing for teams — score 50 Sources: lab_blog/OpenAI

Codex now includes pay-as-you-go pricing for ChatGPT Business and Enterprise, providing teams a more flexible option to start and scale adoption.

Developer Tools

🟡 ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? — score 55 Sources: huggingface

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks

🟡 Gemma 4: Byte for byte, the most capable open models — score 50 Sources: lab_blog/DeepMind

Gemma 4: Our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.

🟡 Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification — score 45 Sources: huggingface

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static

Other Signals

🟡 OpenAI acquires TBPN — score 50 Sources: lab_blog/OpenAI

OpenAI acquires TBPN to accelerate global conversations around AI and support independent media, expanding dialogue with builders, businesses, and the broader tech community.

🟢 Incremental

Model Releases

🟢 QuitoBench: A High-Quality Open Time Series Forecasting Benchmark — score 25 Sources: huggingface

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting wi

🟢 GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation — score 5 Sources: huggingface

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene g

Developer Tools

🟢 Reasoning Shift: How Context Silently Shortens LLM Reasoning — score 35 Sources: huggingface

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this

🟢 HippoCamp: Benchmarking Contextual Agents on Personal Computers — score 15 Sources: huggingface

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to m

📄 New Papers

Title	Category	Score	Link
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers	model_release	187	Open
Terminal Agents Suffice for Enterprise Automation	model_release	103	Open
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome	developer_tool	75	Open
Embarrassingly Simple Self-Distillation Improves Code Generation	model_release	54	Open
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?	developer_tool	46	Open
Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once	cs.AI	0	Open
ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems	cs.AI	0	Open
LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation	cs.AI	0	Open
ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents	cs.AI	0	Open
A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies	cs.AI	0	Open
PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance	cs.AI	0	Open
Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging	cs.AI	0	Open
What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling	cs.AI	0	Open
RAE-AR: Taming Autoregressive Models with Representation Autoencoders	cs.AI	0	Open
Automating Database-Native Function Code Synthesis with LLMs	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: OpenAI acquires TBPN
OpenAI: Codex now offers more flexible pricing for teams
DeepMind: Gemma 4: Byte for byte, the most capable open models

AI Watchtower Briefing — 2026-04-02

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Other Signals

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts