๐Ÿ”ด High Significance

Developer Tools

๐Ÿ”ด CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty โ€” score 95 Sources: huggingface

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertain

๐Ÿ”ด DFlash: Block Diffusion for Flash Speculative Decoding โ€” score 85 Sources: huggingface

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the targ

๐Ÿ”ด Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening โ€” score 75 Sources: huggingface

As large language models (LLMs) evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefin

๐ŸŸก Notable

Model Releases

๐ŸŸก MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents โ€” score 65 Sources: huggingface

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long

Developer Tools

๐ŸŸก Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR โ€” score 55 Sources: huggingface

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often re

๐ŸŸก Context Forcing: Consistent Autoregressive Video Generation with Long Context โ€” score 45 Sources: huggingface

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5

Other Signals

๐ŸŸก Making AI work for everyone, everywhere: our approach to localization โ€” score 50 Sources: lab_blog/OpenAI

OpenAI shares its approach to AI localization, showing how globally shared frontier models can be adapted to local languages, laws, and cultures without compromising safety.

๐ŸŸข Incremental

Model Releases

๐ŸŸข Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations โ€” score 25 Sources: huggingface

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these ca

Developer Tools

๐ŸŸข Reinforced Attention Learning โ€” score 35 Sources: huggingface

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We

๐ŸŸข RISE-Video: Can Video Generators Decode Implicit World Rules? โ€” score 10 Sources: huggingface

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2

๐ŸŸข Reinforcement World Model Learning for LLM-based Agents โ€” score 10 Sources: huggingface

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinfo

๐Ÿ“„ New Papers

TitleCategoryScoreLink
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertaintydeveloper_tool91Open
DFlash: Block Diffusion for Flash Speculative Decodingdeveloper_tool75Open
Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screeningdeveloper_tool73Open
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agentsmodel_release67Open
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVRdeveloper_tool57Open
One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Modelscs.AI0Open
Pro-ZD: A Transferable Graph Neural Network Approach for Proactive Zero-Day Threats Mitigationcs.AI0Open
Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Makingcs.AI0Open
Toward generative machine learning for boosting ensembles of climate simulationscs.AI0Open
Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstructioncs.AI0Open
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoningcs.AI0Open
Accelerating Vision Transformers on Brain Processing Unitcs.AI0Open
Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlationcs.AI0Open
CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Modelscs.AI0Open
The Condensate Theorem: Transformers are O(n), Not $O(n^2)$cs.AI0Open

๐Ÿข Lab Blog Posts