AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 On Data Engineering for Scaling LLM Terminal Capabilities — score 95 Sources: huggingface

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contr

Developer Tools

🔴 Query-focused and Memory-aware Reranker for Long Context Processing — score 85 Sources: huggingface

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic informat

🔴 Test-Time Training with KV Binding Is Secretly Linear Attention — score 75 Sources: huggingface

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these f

🟡 Notable

Developer Tools

🟡 PyVision-RL: Forging Open Agentic Vision Models via RL — score 65 Sources: huggingface

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models th

🟡 From Perception to Action: An Interactive Benchmark for Vision Reasoning — score 55 Sources: huggingface

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess a

🟡 Multi-Vector Index Compression in Any Modality — score 45 Sources: huggingface

We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for

Other Signals

🟡 Disrupting malicious uses of AI | February 2026 — score 50 Sources: lab_blog/OpenAI

Our latest threat report examines how malicious actors combine AI models with websites and social platforms—and what it means for detection and defense.

🟢 Incremental

Model Releases

🟢 DREAM: Deep Research Evaluation with Agentic Metrics — score 10 Sources: huggingface

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong s

Developer Tools

🟢 QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models — score 35 Sources: huggingface

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenec

🟢 LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces — score 25 Sources: huggingface

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evalu

🟢 See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis — score 10 Sources: huggingface

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a

📄 New Papers

Title	Category	Score	Link
On Data Engineering for Scaling LLM Terminal Capabilities	model_release	106	Open
Query-focused and Memory-aware Reranker for Long Context Processing	developer_tool	62	Open
Test-Time Training with KV Binding Is Secretly Linear Attention	developer_tool	34	Open
PyVision-RL: Forging Open Agentic Vision Models via RL	developer_tool	33	Open
From Perception to Action: An Interactive Benchmark for Vision Reasoning	developer_tool	26	Open
Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound	cs.AI	0	Open
Revisiting Text Ranking in Deep Research	cs.AI	0	Open
A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation	cs.AI	0	Open
Poisoned Acoustics	cs.AI	0	Open
GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning	cs.AI	0	Open
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information	cs.AI	0	Open
Training Generalizable Collaborative Agents via Strategic Risk Aversion	cs.AI	0	Open
One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models	cs.AI	0	Open
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies	cs.AI	0	Open
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Disrupting malicious uses of AI | February 2026

AI Watchtower Briefing — 2026-02-25

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Developer Tools

Other Signals

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts