๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections โ€” score 85 Sources: huggingface

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authore

Developer Tools

๐Ÿ”ด Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training โ€” score 95 Sources: huggingface

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows

๐Ÿ”ด IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse โ€” score 75 Sources: huggingface

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grad

๐ŸŸก Notable

Model Releases

๐ŸŸก Video-Based Reward Modeling for Computer-Use Agents โ€” score 65 Sources: huggingface

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independe

Developer Tools

๐ŸŸก ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation โ€” score 55 Sources: huggingface

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in

๐ŸŸก XSkill: Continual Learning from Experience and Skills in Multimodal Agents โ€” score 45 Sources: huggingface

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past traject

๐ŸŸข Incremental

Model Releases

๐ŸŸข DVD: Deterministic Video Depth Estimation with Generative Priors โ€” score 25 Sources: huggingface

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to

๐ŸŸข Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation โ€” score 10 Sources: huggingface

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In

Developer Tools

๐ŸŸข DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning โ€” score 35 Sources: huggingface

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and ide

๐ŸŸข WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing โ€” score 10 Sources: huggingface

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging

๐Ÿ“„ New Papers

TitleCategoryScoreLink
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Trainingdeveloper_tool95Open
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collectionsmodel_release69Open
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reusedeveloper_tool57Open
Video-Based Reward Modeling for Computer-Use Agentsmodel_release46Open
ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creationdeveloper_tool36Open
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoningcs.AI0Open
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentationcs.AI0Open
Embedded Quantum Machine Learning in Embedded Systems: Feasibility, Hybrid Architectures, and Quantum Co-Processorscs.AI0Open
CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learningcs.AI0Open
Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputscs.AI0Open
Concentrated siting of AI data centers drives regional power-system stress under rising global compute demandcs.AI0Open
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantagescs.AI0Open
Sell Me This Stock: Unsafe Recommendation Drift in LLM Agentscs.AI0Open
Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injectioncs.AI0Open
Self-Flow-Matching assisted Full Waveform Inversioncs.AI0Open