AW · AI Watchtower

🔴 High Significance

Developer Tools

🔴 Lost in Stories: Consistency Bugs in Long Story Generation by LLMs — score 95 Sources: huggingface

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts,

🔴 Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence — score 85 Sources: huggingface

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than sy

🔴 LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory — score 75 Sources: huggingface

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architectu

🟡 Notable

Model Releases

🟡 How Far Can Unsupervised RLVR Scale LLM Training? — score 65 Sources: huggingface

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitati

🟡 New ways to learn math and science in ChatGPT — score 50 Sources: lab_blog/OpenAI

ChatGPT introduces interactive visual explanations for math and science, helping students explore formulas, variables, and concepts in real time.

Developer Tools

🟡 Believe Your Model: Distribution-Guided Confidence Calibration — score 55 Sources: huggingface

Large Reasoning Models have demonstrated remarkable performance with the advancement of test-time scaling techniques, which enhances prediction accuracy by generating multiple candidate responses and selecting the most reliable answer. While prior work has analyzed that internal model signals like c

🟡 CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation — score 45 Sources: huggingface

Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the pr

Other Signals

🟡 Improving instruction hierarchy in frontier LLMs — score 50 Sources: lab_blog/OpenAI

IH-Challenge trains models to prioritize trusted instructions, improving instruction hierarchy, safety steerability, and resistance to prompt injection attacks.

🟢 Incremental

Developer Tools

🟢 CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing — score 35 Sources: huggingface

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioni

🟢 HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising — score 25 Sources: huggingface

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods t

🟢 $OneMillion-Bench: How Far are Language Agents from Human Experts? — score 15 Sources: huggingface

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \OneMillion-Bench OneM

🟢 NLE: Non-autoregressive LLM-based ASR by Transcript Editing — score 5 Sources: huggingface

While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully parallel prediction.

📄 New Papers

Title	Category	Score	Link
Lost in Stories: Consistency Bugs in Long Story Generation by LLMs	developer_tool	98	Open
Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence	developer_tool	91	Open
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory	developer_tool	69	Open
How Far Can Unsupervised RLVR Scale LLM Training?	model_release	63	Open
Believe Your Model: Distribution-Guided Confidence Calibration	developer_tool	44	Open
WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion	cs.AI	0	Open
The Epistemic Support-Point Filter: Jaynesian Maximum Entropy Meets Popperian Falsification	cs.AI	0	Open
Time, Identity and Consciousness in Language Model Agents	cs.AI	0	Open
Quantifying Gender Bias in Large Language Models: When ChatGPT Becomes a Hiring Manager	cs.AI	0	Open
EPOCH: An Agentic Protocol for Multi-Round System Optimization	cs.AI	0	Open
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring	cs.AI	0	Open
Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation	cs.AI	0	Open
From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express	cs.AI	0	Open
A Text-Native Interface for Generative Video Authoring	cs.AI	0	Open
GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Improving instruction hierarchy in frontier LLMs
OpenAI: New ways to learn math and science in ChatGPT

AI Watchtower Briefing — 2026-03-10

🔴 High Significance

Developer Tools

🟡 Notable

Model Releases

Developer Tools

Other Signals

🟢 Incremental

Developer Tools

📄 New Papers

🏢 Lab Blog Posts