AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text — score 85 Sources: huggingface

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this,

🔴 ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas — score 70 Sources: huggingface

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either super

Developer Tools

🔴 PaperBanana: Automating Academic Illustration for AI Scientists — score 95 Sources: huggingface

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready a

🔴 Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation — score 70 Sources: huggingface

The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format

🟡 Notable

Model Releases

🟡 THINKSAFE: Self-Generated Safety Alignment for Reasoning Models — score 55 Sources: huggingface

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this saf

🟡 Introducing the Codex app — score 50 Sources: lab_blog/OpenAI

Introducing the Codex app for macOS—a command center for AI coding and software development with multiple agents, parallel workflows, and long-running tasks.

Developer Tools

🟡 Snowflake and OpenAI partner to bring frontier intelligence to enterprise data — score 50 Sources: lab_blog/OpenAI

OpenAI and Snowflake partner in a $200M agreement to bring frontier intelligence into enterprise data, enabling AI agents and insights directly in Snowflake.

🟡 ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought — score 40 Sources: huggingface

While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suff

🟡 TTCS: Test-Time Curriculum Synthesis for Self-Evolving — score 40 Sources: huggingface

Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield hi

🟢 Incremental

Developer Tools

🟢 Causal World Modeling for Robot Control — score 25 Sources: huggingface

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics

🟢 Do Reasoning Models Enhance Embedding Models? — score 15 Sources: huggingface

State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reason

🟢 MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning — score 5 Sources: huggingface

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value detail

📄 New Papers

Title	Category	Score	Link
PaperBanana: Automating Academic Illustration for AI Scientists	developer_tool	240	Open
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text	model_release	117	Open
Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation	developer_tool	65	Open
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas	model_release	65	Open
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models	model_release	42	Open
OpInf-LLM: Parametric PDE Solving with LLMs via Operator Inference	cs.AI	0	Open
Draw2Learn: A Human-AI Collaborative Tool for Drawing-Based Science Learning	cs.AI	0	Open
Governance at the Edge of Architecture: Regulating NeuroAI and Neuromorphic Systems	cs.AI	0	Open
Harnessing Flexible Spatial and Temporal Data Center Workloads for Grid Regulation Services	cs.AI	0	Open
MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation	cs.AI	0	Open
White-Box Neural Ensemble for Vehicular Plasticity: Quantifying the Efficiency Cost of Symbolic Auditability in Adaptive NMPC	cs.AI	0	Open
Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection	cs.AI	0	Open
You Need an Encoder for Native Position-Independent Caching	cs.AI	0	Open
A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning	cs.AI	0	Open
Toward a Machine Bertin: Why Visualization Needs Design Principles for Machine Cognition	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Snowflake and OpenAI partner to bring frontier intelligence to enterprise data
OpenAI: Introducing the Codex app

AI Watchtower Briefing — 2026-02-02

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Developer Tools

📄 New Papers

🏢 Lab Blog Posts