πŸ”΄ High Significance

Model Releases

πŸ”΄ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning β€” score 75 Sources: huggingface

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhea

Developer Tools

πŸ”΄ MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding β€” score 95 Sources: huggingface

Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decod

πŸ”΄ WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG β€” score 85 Sources: huggingface

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datase

🟑 Notable

Model Releases

🟑 Introducing the OpenAI Safety Bug Bounty program β€” score 50 Sources: lab_blog/OpenAI

OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration.

🟑 Lyria 3 Pro: Create longer tracks in more β€” score 50 Sources: lab_blog/DeepMind

Introducing Lyria 3 Pro, which unlocks longer tracks with structural awareness. We’re also bringing Lyria to more Google products and surfaces.

Developer Tools

🟑 From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents β€” score 65 Sources: huggingface

Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimiz

🟑 DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models β€” score 55 Sources: huggingface

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation

🟑 Inside our approach to the Model Spec β€” score 50 Sources: lab_blog/OpenAI

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.

🟑 PEARL: Personalized Streaming Video Understanding Model β€” score 45 Sources: huggingface

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual inp

Other Signals

🟑 Protecting people from harmful manipulation β€” score 50 Sources: lab_blog/DeepMind

Google DeepMind researches AI's harmful manipulation risks across areas like finance and health, leading to new safety measures.

🟒 Incremental

Model Releases

🟒 SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM β€” score 35 Sources: huggingface

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors acros

Developer Tools

🟒 RealMaster: Lifting Rendered Scenes into Photorealistic Video β€” score 25 Sources: huggingface

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines

🟒 UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation β€” score 15 Sources: huggingface

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for

🟒 Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought β€” score 5 Sources: huggingface

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coar

πŸ“„ New Papers

TitleCategoryScoreLink
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decodingdeveloper_tool141Open
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPGdeveloper_tool95Open
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planningmodel_release66Open
From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agentsdeveloper_tool59Open
DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Modelsdeveloper_tool54Open
Object Search in Partially-Known Environments via LLM-informed Model-based Planning and Prompt Selectioncs.AI0Open
Deep Neural Regression Collapsecs.AI0Open
Willful Disobedience: Automatically Detecting Failures in Agentic Tracescs.AI0Open
TED: Training-Free Experience Distillation for Multimodal Reasoningcs.AI0Open
Perturbation: A simple and efficient adversarial tracer for representation learning in language modelscs.AI0Open
Circuit Complexity of Hierarchical Knowledge Tracing and Implications for Log-Precision Transformerscs.AI0Open
Limits of Imagery Reasoning in Frontier LLM Modelscs.AI0Open
Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automationcs.AI0Open
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agentscs.AI0Open
PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplaycs.AI0Open

🏒 Lab Blog Posts