πŸ”΄ High Significance

Model Releases

πŸ”΄ Kimi K2.5: Visual Agentic Intelligence β€” score 85 Sources: huggingface

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vis

πŸ”΄ Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models β€” score 75 Sources: huggingface

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to ob

Developer Tools

πŸ”΄ Green-VLA: Staged Vision-Language-Action Model for Generalist Robots β€” score 95 Sources: huggingface

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodime

🟑 Notable

Model Releases

🟑 Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models β€” score 65 Sources: huggingface

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations

🟑 The Sora feed philosophy β€” score 50 Sources: lab_blog/OpenAI

Discover the Sora feed philosophyβ€”built to spark creativity, foster connections, and keep experiences safe with personalized recommendations, parental controls, and strong guardrails.

Developer Tools

🟑 Closing the Loop: Universal Repository Representation with RPG-Encoder β€” score 55 Sources: huggingface

Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: gene

🟑 UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing β€” score 45 Sources: huggingface

Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmon

🟒 Incremental

Model Releases

🟒 SWE-Universe: Scale Real-World Verifiable Environments to Millions β€” score 35 Sources: huggingface

We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and proh

Developer Tools

🟒 FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents β€” score 25 Sources: huggingface

Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling.

🟒 SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning β€” score 15 Sources: huggingface

Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, ex

🟒 PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss β€” score 5 Sources: huggingface

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leavin

πŸ“„ New Papers

TitleCategoryScoreLink
Green-VLA: Staged Vision-Language-Action Model for Generalist Robotsdeveloper_tool332Open
Kimi K2.5: Visual Agentic Intelligencemodel_release273Open
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Modelsmodel_release160Open
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Modelsmodel_release121Open
Closing the Loop: Universal Repository Representation with RPG-Encoderdeveloper_tool87Open
RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detectioncs.AI0Open
Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairnesscs.AI0Open
NΓΌwa: Mending the Spatial Integrity Torn by VLM Token Pruningcs.AI0Open
UAT-LITE: Inference-Time Uncertainty-Aware Attention for Pretrained Transformerscs.AI0Open
Synthetic Data Augmentation for Medical Audio Classification: A Preliminary Evaluationcs.AI0Open
Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Controlcs.AI0Open
Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growthcs.AI0Open
DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Taskcs.AI0Open
Where Norms and References Collide: Evaluating LLMs on Normative Reasoningcs.AI0Open
Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understandingcs.AI0Open

🏒 Lab Blog Posts