AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs — score 95 Sources: huggingface

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals

🔴 MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs — score 75 Sources: huggingface

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimod

Developer Tools

🔴 SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise — score 85 Sources: huggingface

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex a

🟡 Notable

Model Releases

🟡 Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions — score 50 Sources: huggingface

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descri

🟡 OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence — score 50 Sources: huggingface

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have

Developer Tools

🟡 Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception — score 65 Sources: huggingface

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of i

🟢 Incremental

Model Releases

🟢 GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics — score 15 Sources: huggingface

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought

Developer Tools

🟢 CoPE-VideoLM: Codec Primitives For Efficient Video Language Models — score 35 Sources: huggingface

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, proce

🟢 SemanticMoments: Training-Free Motion Similarity via Third Moment Features — score 25 Sources: huggingface

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centri

🟢 What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis — score 5 Sources: huggingface

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benc

📄 New Papers

Title	Category	Score	Link
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs	model_release	249	Open
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise	developer_tool	220	Open
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs	model_release	81	Open
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception	developer_tool	66	Open
Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions	model_release	56	Open
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence	model_release	56	Open
Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study	cs.AI	0	Open
High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace	cs.AI	0	Open
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)	cs.AI	0	Open
Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data	cs.AI	0	Open
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem	cs.AI	0	Open
Competition for attention predicts good-to-bad tipping in AI	cs.AI	0	Open
Differentially Private Retrieval-Augmented Generation	cs.AI	0	Open
Adapting VACE for Real-Time Autoregressive Video Diffusion	cs.AI	0	Open
Hello-Chat: Towards Realistic Social Audio Interactions	cs.AI	0	Open

AI Watchtower Briefing — 2026-02-16

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers