π΄ High Significance
Developer Tools
π΄ SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model β score 95
Sources: huggingface
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while shar
π΄ MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models β score 85
Sources: huggingface
Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle
Infrastructure & Compute
π΄ HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation β score 75
Sources: huggingface
Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from
π‘ Notable
Model Releases
π‘ DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference β score 65
Sources: huggingface
The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwid
π‘ DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation β score 55
Sources: huggingface
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. F
π‘ OpenAI Codex and Figma launch seamless code-to-design experience β score 50
Sources: lab_blog/OpenAI
OpenAI and Figma launch a new Codex integration that connects code and design, enabling teams to move between implementation and the Figma canvas to iterate and ship faster.
Developer Tools
π‘ Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting β score 50
Sources: lab_blog/OpenAI
OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permittingβshowing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.
π‘ Solaris: Building a Multiplayer Video World Model in Minecraft β score 45
Sources: huggingface
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To e
Enterprise Adoption
π‘ Nano Banana 2: Combining Pro capabilities with lightning-fast speed β score 50
Sources: lab_blog/DeepMind
Our latest image generation model offers advanced world knowledge, production ready specs, subject consistency and more, all at Flash speed.
π’ Incremental
Model Releases
π’ GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL β score 15
Sources: huggingface
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI
Developer Tools
π’ ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning β score 35
Sources: huggingface
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to lar
π’ Image Generation with a Sphere Encoder β score 25
Sources: huggingface
We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent spac
π’ World Guidance: World Modeling in Condition Space for Action Generation β score 5
Sources: huggingface
Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and pr
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model | developer_tool | 63 | Open |
| MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models | developer_tool | 60 | Open |
| HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation | infrastructure | 59 | Open |
| DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference | model_release | 58 | Open |
| DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation | model_release | 43 | Open |
| From Shallow Bayesian Neural Networks to Gaussian Processes: General Convergence, Identifiability and Scalable Inference | cs.AI | 0 | Open |
| A Survey on Quantitative Modeling of Trust in Online Social Networks | cs.AI | 0 | Open |
| Reinforcement-aware Knowledge Distillation for LLM Reasoning | cs.AI | 0 | Open |
| Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models | cs.AI | 0 | Open |
| Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models | cs.AI | 0 | Open |
| SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation | cs.AI | 0 | Open |
| A Mathematical Theory of Agency and Intelligence | cs.AI | 0 | Open |
| Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing | cs.AI | 0 | Open |
| Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents | cs.AI | 0 | Open |
| Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o | cs.AI | 0 | Open |