π΄ High Significance
Model Releases
π΄ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning β score 75
Sources: huggingface
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhea
Developer Tools
π΄ MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding β score 95
Sources: huggingface
Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decod
π΄ WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG β score 85
Sources: huggingface
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datase
π‘ Notable
Model Releases
π‘ Introducing the OpenAI Safety Bug Bounty program β score 50
Sources: lab_blog/OpenAI
OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration.
π‘ Lyria 3 Pro: Create longer tracks in more β score 50
Sources: lab_blog/DeepMind
Introducing Lyria 3 Pro, which unlocks longer tracks with structural awareness. Weβre also bringing Lyria to more Google products and surfaces.
Developer Tools
π‘ From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents β score 65
Sources: huggingface
Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimiz
π‘ DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models β score 55
Sources: huggingface
Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation
π‘ Inside our approach to the Model Spec β score 50
Sources: lab_blog/OpenAI
Learn how OpenAIβs Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.
π‘ PEARL: Personalized Streaming Video Understanding Model β score 45
Sources: huggingface
Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual inp
Other Signals
π‘ Protecting people from harmful manipulation β score 50
Sources: lab_blog/DeepMind
Google DeepMind researches AI's harmful manipulation risks across areas like finance and health, leading to new safety measures.
π’ Incremental
Model Releases
π’ SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM β score 35
Sources: huggingface
High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors acros
Developer Tools
π’ RealMaster: Lifting Rendered Scenes into Photorealistic Video β score 25
Sources: huggingface
State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines
π’ UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation β score 15
Sources: huggingface
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for
π’ Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought β score 5
Sources: huggingface
Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coar
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding | developer_tool | 141 | Open |
| WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG | developer_tool | 95 | Open |
| SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning | model_release | 66 | Open |
| From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents | developer_tool | 59 | Open |
| DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models | developer_tool | 54 | Open |
| Object Search in Partially-Known Environments via LLM-informed Model-based Planning and Prompt Selection | cs.AI | 0 | Open |
| Deep Neural Regression Collapse | cs.AI | 0 | Open |
| Willful Disobedience: Automatically Detecting Failures in Agentic Traces | cs.AI | 0 | Open |
| TED: Training-Free Experience Distillation for Multimodal Reasoning | cs.AI | 0 | Open |
| Perturbation: A simple and efficient adversarial tracer for representation learning in language models | cs.AI | 0 | Open |
| Circuit Complexity of Hierarchical Knowledge Tracing and Implications for Log-Precision Transformers | cs.AI | 0 | Open |
| Limits of Imagery Reasoning in Frontier LLM Models | cs.AI | 0 | Open |
| Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation | cs.AI | 0 | Open |
| VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents | cs.AI | 0 | Open |
| PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay | cs.AI | 0 | Open |