AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning — score 95 Sources: huggingface

Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live compet

Developer Tools

🔴 InCoder-32B-Thinking: Industrial Code World Model for Thinking — score 85 Sources: huggingface

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Tho

🔴 Self-Distilled RLVR — score 75 Sources: huggingface

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains s

🟡 Notable

Model Releases

🟡 Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? — score 55 Sources: huggingface

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and

Developer Tools

🟡 A Simple Baseline for Streaming Video Understanding — score 65 Sources: huggingface

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published s

🟡 Industrial policy for the Intelligence Age — score 50 Sources: lab_blog/OpenAI

Explore our ambitious, people-first industrial policy ideas for the AI era—focused on expanding opportunity, sharing prosperity, and building resilient institutions as advanced intelligence evolves.

Other Signals

🟡 Announcing the OpenAI Safety Fellowship — score 50 Sources: lab_blog/OpenAI

A pilot program to support independent safety and alignment research and develop the next generation of talent

🟡 Token Warping Helps MLLMs Look from Nearby Viewpoints — score 45 Sources: huggingface

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and of

🟢 Incremental

Model Releases

🟢 Communicating about Space: Language-Mediated Spatial Integration Across Partial Views — score 25 Sources: huggingface

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To

🟢 AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents — score 15 Sources: huggingface

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that

Developer Tools

🟢 Test-Time Scaling Makes Overtraining Compute-Optimal — score 35 Sources: huggingface

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size,

🟢 Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation — score 5 Sources: huggingface

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on genera

📄 New Papers

Title	Category	Score	Link
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning	model_release	378	Open
InCoder-32B-Thinking: Industrial Code World Model for Thinking	developer_tool	236	Open
Self-Distilled RLVR	developer_tool	174	Open
A Simple Baseline for Streaming Video Understanding	developer_tool	80	Open
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?	model_release	40	Open
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers	cs.AI	0	Open
Soft Tournament Equilibrium	cs.AI	0	Open
GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction	cs.AI	0	Open
Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications	cs.AI	0	Open
Thermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systems	cs.AI	0	Open
Implementing surrogate goals for safer bargaining in LLM-based agents	cs.AI	0	Open
Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning	cs.AI	0	Open
RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets	cs.AI	0	Open
ReFinE: Streamlining UI Mockup Iteration with Research Findings	cs.AI	0	Open
REAM: Merging Improves Pruning of Experts in LLMs	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Announcing the OpenAI Safety Fellowship
OpenAI: Industrial policy for the Intelligence Age

AI Watchtower Briefing — 2026-04-06

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Other Signals

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts