๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills โ€” score 75 Sources: huggingface

Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow paramet

Developer Tools

๐Ÿ”ด ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling โ€” score 95 Sources: huggingface

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame gener

๐Ÿ”ด Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models โ€” score 85 Sources: huggingface

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing su

๐ŸŸก Notable

Model Releases

๐ŸŸก MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies โ€” score 45 Sources: huggingface

Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigat

Developer Tools

๐ŸŸก PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference โ€” score 65 Sources: huggingface

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that eff

๐ŸŸก Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models โ€” score 55 Sources: huggingface

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conver

๐ŸŸข Incremental

Model Releases

๐ŸŸข RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation โ€” score 35 Sources: huggingface

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new

Developer Tools

๐ŸŸข Natural-Language Agent Harnesses โ€” score 25 Sources: huggingface

Agent performance increasingly depends on harness engineering, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead

๐ŸŸข Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models โ€” score 10 Sources: huggingface

Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing

๐ŸŸข LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset โ€” score 10 Sources: huggingface

In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions,

๐Ÿ“„ New Papers

TitleCategoryScoreLink
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytellingdeveloper_tool161Open
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Modelsdeveloper_tool160Open
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skillsmodel_release66Open
PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inferencedeveloper_tool56Open
Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Modelsdeveloper_tool38Open
ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothingcs.AI0Open
Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Surveycs.AI0Open
A Learning-Based Cooperative Coevolution Framework for Heterogeneous Large-Scale Global Optimizationcs.AI0Open
Beyond Message Passing: A Semantic View of Agent Communication Protocolscs.AI0Open
GEAKG: Generative Executable Algorithm Knowledge Graphscs.AI0Open
Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNscs.AI0Open
JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understandingcs.AI0Open
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inferencecs.AI0Open
CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMscs.AI0Open
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topologycs.AI0Open