π΄ High Significance
Model Releases
π΄ Kimi K2.5: Visual Agentic Intelligence β score 85
Sources: huggingface
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vis
π΄ Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models β score 75
Sources: huggingface
Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call'' for visual and textual search engines to ob
Developer Tools
π΄ Green-VLA: Staged Vision-Language-Action Model for Generalist Robots β score 95
Sources: huggingface
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodime
π‘ Notable
Model Releases
π‘ Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models β score 65
Sources: huggingface
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations
π‘ The Sora feed philosophy β score 50
Sources: lab_blog/OpenAI
Discover the Sora feed philosophyβbuilt to spark creativity, foster connections, and keep experiences safe with personalized recommendations, parental controls, and strong guardrails.
Developer Tools
π‘ Closing the Loop: Universal Repository Representation with RPG-Encoder β score 55
Sources: huggingface
Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: gene
π‘ UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing β score 45
Sources: huggingface
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmon
π’ Incremental
Model Releases
π’ SWE-Universe: Scale Real-World Verifiable Environments to Millions β score 35
Sources: huggingface
We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and proh
Developer Tools
π’ FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents β score 25
Sources: huggingface
Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling.
π’ SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning β score 15
Sources: huggingface
Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, ex
π’ PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss β score 5
Sources: huggingface
Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leavin
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| Green-VLA: Staged Vision-Language-Action Model for Generalist Robots | developer_tool | 332 | Open |
| Kimi K2.5: Visual Agentic Intelligence | model_release | 273 | Open |
| Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models | model_release | 160 | Open |
| Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models | model_release | 121 | Open |
| Closing the Loop: Universal Repository Representation with RPG-Encoder | developer_tool | 87 | Open |
| RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection | cs.AI | 0 | Open |
| Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness | cs.AI | 0 | Open |
| NΓΌwa: Mending the Spatial Integrity Torn by VLM Token Pruning | cs.AI | 0 | Open |
| UAT-LITE: Inference-Time Uncertainty-Aware Attention for Pretrained Transformers | cs.AI | 0 | Open |
| Synthetic Data Augmentation for Medical Audio Classification: A Preliminary Evaluation | cs.AI | 0 | Open |
| Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control | cs.AI | 0 | Open |
| Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth | cs.AI | 0 | Open |
| DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task | cs.AI | 0 | Open |
| Where Norms and References Collide: Evaluating LLMs on Normative Reasoning | cs.AI | 0 | Open |
| Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding | cs.AI | 0 | Open |
π’ Lab Blog Posts
- OpenAI: The Sora feed philosophy