AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier — score 85 Sources: huggingface

While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, P(hypothesis|background) (P(h|b)), unexplored. We demonstrate that directly training P(h|b) is

Developer Tools

🔴 SkillNet: Create, Evaluate, and Connect AI Skills — score 95 Sources: huggingface

Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in

🔴 DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval — score 75 Sources: huggingface

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore dat

🟡 Notable

Model Releases

🟡 AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios — score 65 Sources: huggingface

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under ro

Developer Tools

🟡 RoboPocket: Improve Robot Policies Instantly with Your Phone — score 55 Sources: huggingface

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing th

🟡 Codex Security: now in research preview — score 50 Sources: lab_blog/OpenAI

Codex Security is an AI application security agent that analyzes project context to detect, validate, and patch complex vulnerabilities with higher confidence and less noise.

🟡 How Descript engineers multilingual video dubbing at scale — score 50 Sources: lab_blog/OpenAI

Using OpenAI reasoning models, Descript unlocked automatic localization of large content libraries without losing timing or meaning.

🟡 How Balyasny Asset Management built an AI research engine — score 50 Sources: lab_blog/OpenAI

By combining rigorous model evaluation, full-platform use of OpenAI, and agent workflows, Balyasny is reinventing investment research.

🟡 MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models — score 45 Sources: huggingface

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and ide

🟢 Incremental

Model Releases

🟢 Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling — score 5 Sources: huggingface

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial

Developer Tools

🟢 HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images — score 35 Sources: huggingface

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-

🟢 Interactive Benchmarks — score 25 Sources: huggingface

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm tha

🟢 Large Multimodal Models as General In-Context Classifiers — score 15 Sources: huggingface

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tas

📄 New Papers

Title	Category	Score	Link
SkillNet: Create, Evaluate, and Connect AI Skills	developer_tool	99	Open
MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier	model_release	95	Open
DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval	developer_tool	56	Open
AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios	model_release	47	Open
RoboPocket: Improve Robot Policies Instantly with Your Phone	developer_tool	39	Open
Bridging Domains through Subspace-Aware Model Merging	cs.AI	0	Open
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads	cs.AI	0	Open
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models	cs.AI	0	Open
PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models	cs.AI	0	Open
Balancing Domestic and Global Perspectives: Evaluating Dual-Calibration and LLM-Generated Nudges for Diverse News Recommendation	cs.AI	0	Open
Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval	cs.AI	0	Open
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It	cs.AI	0	Open
ProtAlign: Contrastive learning paradigm for Sequence and structure alignment	cs.AI	0	Open
AWPD: Frequency Shield Network for Agnostic Watermark Presence Detection	cs.AI	0	Open
Bi Directional Feedback Fusion for Activity Aware Forecasting of Indoor CO2 and PM2.5	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Codex Security: now in research preview
OpenAI: How Descript engineers multilingual video dubbing at scale
OpenAI: How Balyasny Asset Management built an AI research engine

AI Watchtower Briefing — 2026-03-06

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts