๐ด High Significance
Model Releases
๐ด Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections โ score 85
Sources: huggingface
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authore
Developer Tools
๐ด Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training โ score 95
Sources: huggingface
Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows
๐ด IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse โ score 75
Sources: huggingface
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grad
๐ก Notable
Model Releases
๐ก Video-Based Reward Modeling for Computer-Use Agents โ score 65
Sources: huggingface
Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independe
Developer Tools
๐ก ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation โ score 55
Sources: huggingface
Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in
๐ก XSkill: Continual Learning from Experience and Skills in Multimodal Agents โ score 45
Sources: huggingface
Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past traject
๐ข Incremental
Model Releases
๐ข DVD: Deterministic Video Depth Estimation with Generative Priors โ score 25
Sources: huggingface
Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to
๐ข Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation โ score 10
Sources: huggingface
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In
Developer Tools
๐ข DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning โ score 35
Sources: huggingface
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and ide
๐ข WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing โ score 10
Sources: huggingface
Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training | developer_tool | 95 | Open |
| Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections | model_release | 69 | Open |
| IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse | developer_tool | 57 | Open |
| Video-Based Reward Modeling for Computer-Use Agents | model_release | 46 | Open |
| ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation | developer_tool | 36 | Open |
| TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning | cs.AI | 0 | Open |
| Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation | cs.AI | 0 | Open |
| Embedded Quantum Machine Learning in Embedded Systems: Feasibility, Hybrid Architectures, and Quantum Co-Processors | cs.AI | 0 | Open |
| CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning | cs.AI | 0 | Open |
| Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs | cs.AI | 0 | Open |
| Concentrated siting of AI data centers drives regional power-system stress under rising global compute demand | cs.AI | 0 | Open |
| Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages | cs.AI | 0 | Open |
| Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents | cs.AI | 0 | Open |
| Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection | cs.AI | 0 | Open |
| Self-Flow-Matching assisted Full Waveform Inversion | cs.AI | 0 | Open |