πŸ”΄ High Significance

Model Releases

πŸ”΄ Multimodal OCR: Parse Anything from Documents β€” score 85 Sources: huggingface

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements

Developer Tools

πŸ”΄ LMEB: Long-horizon Memory Embedding Benchmark β€” score 95 Sources: huggingface

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving f

πŸ”΄ Can Vision-Language Models Solve the Shell Game? β€” score 75 Sources: huggingface

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identica

🟑 Notable

Model Releases

🟑 Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation β€” score 65 Sources: huggingface

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we pr

🟑 OmniForcing: Unleashing Real-time Joint Audio-Visual Generation β€” score 55 Sources: huggingface

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion mo

🟑 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously β€” score 40 Sources: huggingface

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response late

🟑 daVinci-Env: Open SWE Environment Synthesis at Scale β€” score 40 Sources: huggingface

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diver

Developer Tools

🟑 Why Codex Security Doesn’t Include a SAST Report β€” score 50 Sources: lab_blog/OpenAI

A deep dive into why Codex Security doesn’t rely on traditional SAST, instead using AI-driven constraint reasoning and validation to find real vulnerabilities with fewer false positives.

🟒 Incremental

Model Releases

🟒 Visual-ERM: Reward Modeling for Visual Equivalence β€” score 25 Sources: huggingface

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement l

Developer Tools

🟒 MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning β€” score 15 Sources: huggingface

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process

🟒 Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents β€” score 5 Sources: huggingface

Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, allowing agents to exhaust token and tool budgets on redundant steps or dead-end trajectories. Existing budget-aware methods either require expensive fi

πŸ“„ New Papers

TitleCategoryScoreLink
LMEB: Long-horizon Memory Embedding Benchmarkdeveloper_tool79Open
Multimodal OCR: Parse Anything from Documentsmodel_release49Open
Can Vision-Language Models Solve the Shell Game?developer_tool42Open
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generationmodel_release41Open
OmniForcing: Unleashing Real-time Joint Audio-Visual Generationmodel_release35Open
MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Modelcs.AI0Open
AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systemscs.AI0Open
Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programmingcs.AI0Open
Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptationcs.AI0Open
Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimizationcs.AI0Open
AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformerscs.AI0Open
Transition Flow Matchingcs.AI0Open
GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representationcs.AI0Open
Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systemscs.AI0Open
Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Mapscs.AI0Open

🏒 Lab Blog Posts