๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale โ€” score 75 Sources: huggingface

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number o

Developer Tools

๐Ÿ”ด OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens โ€” score 95 Sources: huggingface

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON fil

๐Ÿ”ด From Scale to Speed: Adaptive Test-Time Scaling for Image Editing โ€” score 85 Sources: huggingface

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image

๐ŸŸก Notable

Model Releases

๐ŸŸก RubricBench: Aligning Model-Generated Rubrics with Human Standards โ€” score 65 Sources: huggingface

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation pa

๐ŸŸก CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning โ€” score 55 Sources: huggingface

Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable sett

๐ŸŸก GPT-5.3 Instant System Card โ€” score 50 Sources: lab_blog/OpenAI

๐ŸŸก GPT-5.3 Instant: Smoother, more useful everyday conversations โ€” score 50 Sources: lab_blog/OpenAI

๐ŸŸก Gemini 3.1 Flash-Lite: Built for intelligence at scale โ€” score 50 Sources: lab_blog/DeepMind

Gemini 3.1 Flash-Lite is our fastest and most cost-efficient Gemini 3 series model yet.

Developer Tools

๐ŸŸก OpenAutoNLU: Open Source AutoML Library for NLU โ€” score 45 Sources: huggingface

OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration fr

๐ŸŸข Incremental

Model Releases

๐ŸŸข MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning โ€” score 35 Sources: huggingface

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely

Developer Tools

๐ŸŸข VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection โ€” score 25 Sources: huggingface

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-

๐ŸŸข CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification โ€” score 5 Sources: huggingface

Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce CoVe (Constraint-Verification), a post-training data synthesis framework de

Infrastructure & Compute

๐ŸŸข CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction โ€” score 15 Sources: huggingface

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multim

๐Ÿ“„ New Papers

TitleCategoryScoreLink
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokensdeveloper_tool156Open
From Scale to Speed: Adaptive Test-Time Scaling for Image Editingdeveloper_tool143Open
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scalemodel_release91Open
RubricBench: Aligning Model-Generated Rubrics with Human Standardsmodel_release67Open
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoningmodel_release59Open
PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inferencecs.AI0Open
What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertaintycs.AI0Open
Form Follows Function: Recursive Stem Modelcs.AI0Open
Revealing Positive and Negative Role Models to Help People Make Good Decisionscs.AI0Open
NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effectcs.AI0Open
Learning Object-Centric Spatial Reasoning for Sequential Manipulation in Cluttered Environmentscs.AI0Open
Human-Certified Module Repositories for the AI Agecs.AI0Open
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Modelcs.AI0Open
Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamicscs.AI0Open
A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilitiescs.AI0Open

๐Ÿข Lab Blog Posts