๐ด High Significance
Model Releases
๐ด Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills โ score 75
Sources: huggingface
Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow paramet
Developer Tools
๐ด ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling โ score 95
Sources: huggingface
Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame gener
๐ด Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models โ score 85
Sources: huggingface
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing su
๐ก Notable
Model Releases
๐ก MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies โ score 45
Sources: huggingface
Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigat
Developer Tools
๐ก PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference โ score 65
Sources: huggingface
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that eff
๐ก Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models โ score 55
Sources: huggingface
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conver
๐ข Incremental
Model Releases
๐ข RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation โ score 35
Sources: huggingface
Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \texttt{RealChart2Code}, a new
Developer Tools
๐ข Natural-Language Agent Harnesses โ score 25
Sources: huggingface
Agent performance increasingly depends on harness engineering, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead
๐ข Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models โ score 10
Sources: huggingface
Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing
๐ข LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset โ score 10
Sources: huggingface
In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions,
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling | developer_tool | 161 | Open |
| Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models | developer_tool | 160 | Open |
| Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills | model_release | 66 | Open |
| PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference | developer_tool | 56 | Open |
| Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models | developer_tool | 38 | Open |
| ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing | cs.AI | 0 | Open |
| Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey | cs.AI | 0 | Open |
| A Learning-Based Cooperative Coevolution Framework for Heterogeneous Large-Scale Global Optimization | cs.AI | 0 | Open |
| Beyond Message Passing: A Semantic View of Agent Communication Protocols | cs.AI | 0 | Open |
| GEAKG: Generative Executable Algorithm Knowledge Graphs | cs.AI | 0 | Open |
| Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs | cs.AI | 0 | Open |
| JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding | cs.AI | 0 | Open |
| CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference | cs.AI | 0 | Open |
| CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs | cs.AI | 0 | Open |
| SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology | cs.AI | 0 | Open |