๐ด High Significance
Model Releases
๐ด Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing โ score 95
Sources: huggingface
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective t
Developer Tools
๐ด Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs โ score 85
Sources: huggingface
While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Never
๐ด MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data โ score 75
Sources: huggingface
Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs in
๐ก Notable
Model Releases
๐ก Designing AI agents to resist prompt injection โ score 50
Sources: lab_blog/OpenAI
How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.
๐ก Fish Audio S2 Technical Report โ score 45
Sources: huggingface
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline c
Developer Tools
๐ก Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion โ score 65
Sources: huggingface
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies h
๐ก InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing โ score 55
Sources: huggingface
Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that demo
๐ก From model to agent: Equipping the Responses API with a computer environment โ score 50
Sources: lab_blog/OpenAI
How OpenAI built an agent runtime using the Responses API, shell tool, and hosted containers to run secure, scalable agents with files, tools, and state.
๐ก Rakuten fixes issues twice as fast with Codex โ score 50
Sources: lab_blog/OpenAI
Business & Funding
๐ก Wayfair boosts catalog accuracy and support speed with OpenAI โ score 50
Sources: lab_blog/OpenAI
Wayfair uses OpenAI models to improve ecommerce support and product catalog accuracy, automating ticket triage and enhancing millions of product attributes at scale.
๐ข Incremental
Model Releases
๐ข Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports โ score 35
Sources: huggingface
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interact
๐ข Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs โ score 25
Sources: huggingface
Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both sy
๐ข Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering โ score 5
Sources: huggingface
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To a
Developer Tools
๐ข MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants โ score 15
Sources: huggingface
With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construc
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing | model_release | 155 | Open |
| Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs | developer_tool | 79 | Open |
| MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data | developer_tool | 56 | Open |
| Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion | developer_tool | 55 | Open |
| InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing | developer_tool | 53 | Open |
| Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums | cs.AI | 0 | Open |
| Quantum entanglement provides a competitive advantage in adversarial games | cs.AI | 0 | Open |
| Hybrid Self-evolving Structured Memory for GUI Agents | cs.AI | 0 | Open |
| Simulation-in-the-Reasoning (SiR): A Conceptual Framework for Empirically Grounded AI in Autonomous Transportation | cs.AI | 0 | Open |
| Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas | cs.AI | 0 | Open |
| The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation | cs.AI | 0 | Open |
| NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction | cs.AI | 0 | Open |
| PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Planner | cs.AI | 0 | Open |
| Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers | cs.AI | 0 | Open |
| Querying Everything Everywhere All at Once: Supervaluationism for the Agentic Lakehouse | cs.AI | 0 | Open |