AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing — score 95 Sources: huggingface

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective t

Developer Tools

🔴 Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs — score 85 Sources: huggingface

While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Never

🔴 MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data — score 75 Sources: huggingface

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs in

🟡 Notable

Model Releases

🟡 Designing AI agents to resist prompt injection — score 50 Sources: lab_blog/OpenAI

How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.

🟡 Fish Audio S2 Technical Report — score 45 Sources: huggingface

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline c

Developer Tools

🟡 Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion — score 65 Sources: huggingface

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies h

🟡 InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing — score 55 Sources: huggingface

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that demo

🟡 From model to agent: Equipping the Responses API with a computer environment — score 50 Sources: lab_blog/OpenAI

How OpenAI built an agent runtime using the Responses API, shell tool, and hosted containers to run secure, scalable agents with files, tools, and state.

🟡 Rakuten fixes issues twice as fast with Codex — score 50 Sources: lab_blog/OpenAI

Business & Funding

🟡 Wayfair boosts catalog accuracy and support speed with OpenAI — score 50 Sources: lab_blog/OpenAI

Wayfair uses OpenAI models to improve ecommerce support and product catalog accuracy, automating ticket triage and enhancing millions of product attributes at scale.

🟢 Incremental

Model Releases

🟢 Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports — score 35 Sources: huggingface

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interact

🟢 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs — score 25 Sources: huggingface

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both sy

🟢 Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering — score 5 Sources: huggingface

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To a

Developer Tools

🟢 MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants — score 15 Sources: huggingface

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construc

📄 New Papers

Title	Category	Score	Link
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing	model_release	155	Open
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs	developer_tool	79	Open
MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data	developer_tool	56	Open
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion	developer_tool	55	Open
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing	developer_tool	53	Open
Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums	cs.AI	0	Open
Quantum entanglement provides a competitive advantage in adversarial games	cs.AI	0	Open
Hybrid Self-evolving Structured Memory for GUI Agents	cs.AI	0	Open
Simulation-in-the-Reasoning (SiR): A Conceptual Framework for Empirically Grounded AI in Autonomous Transportation	cs.AI	0	Open
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas	cs.AI	0	Open
The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation	cs.AI	0	Open
NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction	cs.AI	0	Open
PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Planner	cs.AI	0	Open
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers	cs.AI	0	Open
Querying Everything Everywhere All at Once: Supervaluationism for the Agentic Lakehouse	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Designing AI agents to resist prompt injection
OpenAI: From model to agent: Equipping the Responses API with a computer environment
OpenAI: Wayfair boosts catalog accuracy and support speed with OpenAI
OpenAI: Rakuten fixes issues twice as fast with Codex

AI Watchtower Briefing — 2026-03-11

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Business & Funding

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts