π΄ High Significance
Model Releases
π΄ Multimodal OCR: Parse Anything from Documents β score 85
Sources: huggingface
We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements
Developer Tools
π΄ LMEB: Long-horizon Memory Embedding Benchmark β score 95
Sources: huggingface
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving f
π΄ Can Vision-Language Models Solve the Shell Game? β score 75
Sources: huggingface
Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identica
π‘ Notable
Model Releases
π‘ Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation β score 65
Sources: huggingface
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we pr
π‘ OmniForcing: Unleashing Real-time Joint Audio-Visual Generation β score 55
Sources: huggingface
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion mo
π‘ Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously β score 40
Sources: huggingface
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response late
π‘ daVinci-Env: Open SWE Environment Synthesis at Scale β score 40
Sources: huggingface
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diver
Developer Tools
π‘ Why Codex Security Doesnβt Include a SAST Report β score 50
Sources: lab_blog/OpenAI
A deep dive into why Codex Security doesnβt rely on traditional SAST, instead using AI-driven constraint reasoning and validation to find real vulnerabilities with fewer false positives.
π’ Incremental
Model Releases
π’ Visual-ERM: Reward Modeling for Visual Equivalence β score 25
Sources: huggingface
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement l
Developer Tools
π’ MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning β score 15
Sources: huggingface
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process
π’ Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents β score 5
Sources: huggingface
Test-time scaling has become a dominant paradigm for improving LLM agent reliability, yet current approaches treat compute as an abundant resource, allowing agents to exhaust token and tool budgets on redundant steps or dead-end trajectories. Existing budget-aware methods either require expensive fi
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| LMEB: Long-horizon Memory Embedding Benchmark | developer_tool | 79 | Open |
| Multimodal OCR: Parse Anything from Documents | model_release | 49 | Open |
| Can Vision-Language Models Solve the Shell Game? | developer_tool | 42 | Open |
| Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation | model_release | 41 | Open |
| OmniForcing: Unleashing Real-time Joint Audio-Visual Generation | model_release | 35 | Open |
| MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model | cs.AI | 0 | Open |
| AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems | cs.AI | 0 | Open |
| Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programming | cs.AI | 0 | Open |
| Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation | cs.AI | 0 | Open |
| Beyond Local Code Optimization: Multi-Agent Reasoning for Software System Optimization | cs.AI | 0 | Open |
| AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers | cs.AI | 0 | Open |
| Transition Flow Matching | cs.AI | 0 | Open |
| GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation | cs.AI | 0 | Open |
| Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems | cs.AI | 0 | Open |
| Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps | cs.AI | 0 | Open |