AW · AI Watchtower

🔴 High Significance

Developer Tools

🔴 ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents — score 95 Sources: huggingface

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity tha

🔴 KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance — score 85 Sources: huggingface

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which int

🔴 Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe — score 75 Sources: huggingface

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds

🟡 Notable

Model Releases

🟡 Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning — score 55 Sources: huggingface

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mix

🟡 Gemini 3.1 Flash TTS: the next generation of expressive AI speech — score 50 Sources: lab_blog/DeepMind

Our newest audio model introduces granular audio tags that give you precise control to direct AI speech for expressive audio generation.

Developer Tools

🟡 Lyra 2.0: Explorable Generative 3D Worlds — score 65 Sources: huggingface

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative

🟡 The next evolution of the Agents SDK — score 50 Sources: lab_blog/OpenAI

OpenAI updates the Agents SDK with native sandbox execution and a model-native harness, helping developers build secure, long-running agents across files and tools.

🟡 Toward Autonomous Long-Horizon Engineering for ML Research — score 45 Sources: huggingface

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for auton

🟢 Incremental

Model Releases

🟢 BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation — score 25 Sources: huggingface

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a mode

🟢 The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents — score 5 Sources: huggingface

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a su

Developer Tools

🟢 Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization — score 25 Sources: huggingface

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization c

🟢 SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks — score 25 Sources: huggingface

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohib

📄 New Papers

Title	Category	Score	Link
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents	developer_tool	149	Open
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance	developer_tool	103	Open
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe	developer_tool	94	Open
Lyra 2.0: Explorable Generative 3D Worlds	developer_tool	43	Open
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning	model_release	37	Open
A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings	cs.AI	0	Open
Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health	cs.AI	0	Open
On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problem	cs.AI	0	Open
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold	cs.AI	0	Open
Quantifying and Understanding Uncertainty in Large Reasoning Models	cs.AI	0	Open
From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning	cs.AI	0	Open
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence	cs.AI	0	Open
DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis	cs.AI	0	Open
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability	cs.AI	0	Open
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: The next evolution of the Agents SDK
DeepMind: Gemini 3.1 Flash TTS: the next generation of expressive AI speech

AI Watchtower Briefing — 2026-04-15

🔴 High Significance

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts