AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale — score 75 Sources: huggingface

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number o

Developer Tools

🔴 OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens — score 95 Sources: huggingface

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON fil

🔴 From Scale to Speed: Adaptive Test-Time Scaling for Image Editing — score 85 Sources: huggingface

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image

🟡 Notable

Model Releases

🟡 RubricBench: Aligning Model-Generated Rubrics with Human Standards — score 65 Sources: huggingface

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation pa

🟡 CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning — score 55 Sources: huggingface

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable sett

🟡 GPT-5.3 Instant System Card — score 50 Sources: lab_blog/OpenAI

🟡 GPT-5.3 Instant: Smoother, more useful everyday conversations — score 50 Sources: lab_blog/OpenAI

🟡 Gemini 3.1 Flash-Lite: Built for intelligence at scale — score 50 Sources: lab_blog/DeepMind

Gemini 3.1 Flash-Lite is our fastest and most cost-efficient Gemini 3 series model yet.

Developer Tools

🟡 OpenAutoNLU: Open Source AutoML Library for NLU — score 45 Sources: huggingface

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration fr

🟢 Incremental

Model Releases

🟢 MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning — score 35 Sources: huggingface

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely

Developer Tools

🟢 VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection — score 25 Sources: huggingface

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-

🟢 CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification — score 5 Sources: huggingface

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework de

Infrastructure & Compute

🟢 CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction — score 15 Sources: huggingface

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multim

📄 New Papers

Title	Category	Score	Link
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens	developer_tool	156	Open
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing	developer_tool	143	Open
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale	model_release	91	Open
RubricBench: Aligning Model-Generated Rubrics with Human Standards	model_release	67	Open
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning	model_release	59	Open
PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference	cs.AI	0	Open
What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty	cs.AI	0	Open
Form Follows Function: Recursive Stem Model	cs.AI	0	Open
Revealing Positive and Negative Role Models to Help People Make Good Decisions	cs.AI	0	Open
NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect	cs.AI	0	Open
Learning Object-Centric Spatial Reasoning for Sequential Manipulation in Cluttered Environments	cs.AI	0	Open
Human-Certified Module Repositories for the AI Age	cs.AI	0	Open
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model	cs.AI	0	Open
Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics	cs.AI	0	Open
A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: GPT-5.3 Instant System Card
OpenAI: GPT-5.3 Instant: Smoother, more useful everyday conversations
DeepMind: Gemini 3.1 Flash-Lite: Built for intelligence at scale

AI Watchtower Briefing — 2026-03-03

🔴 High Significance

Model Releases

Developer Tools

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

📄 New Papers

🏢 Lab Blog Posts