๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing โ€” score 95 Sources: huggingface

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective t

Developer Tools

๐Ÿ”ด Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs โ€” score 85 Sources: huggingface

While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Never

๐Ÿ”ด MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data โ€” score 75 Sources: huggingface

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs in

๐ŸŸก Notable

Model Releases

๐ŸŸก Designing AI agents to resist prompt injection โ€” score 50 Sources: lab_blog/OpenAI

How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.

๐ŸŸก Fish Audio S2 Technical Report โ€” score 45 Sources: huggingface

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline c

Developer Tools

๐ŸŸก Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion โ€” score 65 Sources: huggingface

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies h

๐ŸŸก InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing โ€” score 55 Sources: huggingface

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that demo

๐ŸŸก From model to agent: Equipping the Responses API with a computer environment โ€” score 50 Sources: lab_blog/OpenAI

How OpenAI built an agent runtime using the Responses API, shell tool, and hosted containers to run secure, scalable agents with files, tools, and state.

๐ŸŸก Rakuten fixes issues twice as fast with Codex โ€” score 50 Sources: lab_blog/OpenAI

Business & Funding

๐ŸŸก Wayfair boosts catalog accuracy and support speed with OpenAI โ€” score 50 Sources: lab_blog/OpenAI

Wayfair uses OpenAI models to improve ecommerce support and product catalog accuracy, automating ticket triage and enhancing millions of product attributes at scale.

๐ŸŸข Incremental

Model Releases

๐ŸŸข Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports โ€” score 35 Sources: huggingface

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interact

๐ŸŸข Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs โ€” score 25 Sources: huggingface

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both sy

๐ŸŸข Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering โ€” score 5 Sources: huggingface

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To a

Developer Tools

๐ŸŸข MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants โ€” score 15 Sources: huggingface

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construc

๐Ÿ“„ New Papers

TitleCategoryScoreLink
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editingmodel_release155Open
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMsdeveloper_tool79Open
MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Datadeveloper_tool56Open
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusiondeveloper_tool55Open
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editingdeveloper_tool53Open
Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museumscs.AI0Open
Quantum entanglement provides a competitive advantage in adversarial gamescs.AI0Open
Hybrid Self-evolving Structured Memory for GUI Agentscs.AI0Open
Simulation-in-the-Reasoning (SiR): A Conceptual Framework for Empirically Grounded AI in Autonomous Transportationcs.AI0Open
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideascs.AI0Open
The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluationcs.AI0Open
NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interactioncs.AI0Open
PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Plannercs.AI0Open
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankerscs.AI0Open
Querying Everything Everywhere All at Once: Supervaluationism for the Agentic Lakehousecs.AI0Open

๐Ÿข Lab Blog Posts