AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters — score 95 Sources: huggingface

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with

🔴 GENIUS: Generative Fluid Intelligence Evaluation Suite — score 75 Sources: huggingface

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity t

Developer Tools

🔴 VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval — score 85 Sources: huggingface

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we fo

🟡 Notable

Model Releases

🟡 ASA: Training-Free Representation Engineering for Tool-Calling Agents — score 55 Sources: huggingface

Adapting LLM agents to domain-specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is easy to deploy but often fragile under distribution shift and strict parsers, while continual parameter-efficient fine-tuning improves reliability at the cost of tr

🟡 Introducing GPT-5.3-Codex-Spark — score 50 Sources: lab_blog/OpenAI

Introducing GPT-5.3-Codex-Spark—our first real-time coding model. 15x faster generation, 128k context, now in research preview for ChatGPT Pro users.

🟡 Gemini 3 Deep Think: Advancing science, research and engineering — score 50 Sources: lab_blog/DeepMind

Our most specialized reasoning mode is now updated to solve modern science, research and engineering challenges.

🟡 Towards Autonomous Mathematics Research — score 45 Sources: huggingface

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing l

Developer Tools

🟡 PhyCritic: Multimodal Critic Models for Physical AI — score 65 Sources: huggingface

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existin

🟢 Incremental

Model Releases

🟢 TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions — score 10 Sources: huggingface

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling reade

Developer Tools

🟢 When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning — score 35 Sources: huggingface

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an

🟢 How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning — score 25 Sources: huggingface

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention ma

🟢 G-LNS: Generative Large Neighborhood Search for LLM-Based Automatic Heuristic Design — score 10 Sources: huggingface

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing approaches typically formulate AHD around constructive priority rules or parameterized local search guidance, thereby restricting the search space to fixed heuristic forms. Such designs offer

📄 New Papers

Title	Category	Score	Link
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters	model_release	202	Open
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval	developer_tool	126	Open
GENIUS: Generative Fluid Intelligence Evaluation Suite	model_release	58	Open
PhyCritic: Multimodal Critic Models for Physical AI	developer_tool	57	Open
ASA: Training-Free Representation Engineering for Tool-Calling Agents	model_release	44	Open
From Noise to Order: Learning to Rank via Denoising Diffusion	cs.AI	0	Open
Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning	cs.AI	0	Open
EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement & Routing for RF Circuits	cs.AI	0	Open
Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris	cs.AI	0	Open
Understanding Persuasive Interactions between Generative Social Agents and Humans: The Knowledge-based Persuasion Model (KPM)	cs.AI	0	Open
IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection	cs.AI	0	Open
RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis	cs.AI	0	Open
Multimodal Fact-Level Attribution for Verifiable Reasoning	cs.AI	0	Open
AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems	cs.AI	0	Open
Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Introducing GPT-5.3-Codex-Spark
DeepMind: Gemini 3 Deep Think: Advancing science, research and engineering

AI Watchtower Briefing — 2026-02-12

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts