๐Ÿ”ด High Significance

Developer Tools

๐Ÿ”ด Recursive Multi-Agent Systems โ€” score 95 Sources: huggingface

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled th

๐Ÿ”ด DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios โ€” score 75 Sources: huggingface

Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we i

Infrastructure & Compute

๐Ÿ”ด Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora โ€” score 85 Sources: huggingface

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, t

๐ŸŸก Notable

Model Releases

๐ŸŸก AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery โ€” score 65 Sources: huggingface

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting clai

๐ŸŸก Meta-CoT: Enhancing Granularity and Generalization in Image Editing โ€” score 55 Sources: huggingface

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance bot

๐ŸŸก Where the goblins came from โ€” score 50 Sources: lab_blog/OpenAI

How goblin outputs spread in AI models: timeline, root cause, and fixes behind personality-driven quirks in GPT-5 behavior.

Developer Tools

๐ŸŸก Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models โ€” score 45 Sources: huggingface

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refineme

Infrastructure & Compute

๐ŸŸก Building the compute infrastructure for the Intelligence Age โ€” score 50 Sources: lab_blog/OpenAI

OpenAI scales Stargate to build the compute infrastructure powering AGI, adding new data center capacity to meet growing AI demand.

Other Signals

๐ŸŸก Cybersecurity in the Intelligence Age โ€” score 50 Sources: lab_blog/OpenAI

OpenAI outlines a five-part action plan for strengthening cybersecurity in the Intelligence Age, focused on democratizing AI-powered cyber defense and protecting critical systems.

๐ŸŸข Incremental

Developer Tools

๐ŸŸข Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation โ€” score 35 Sources: huggingface

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adop

๐ŸŸข Co-Director: Agentic Generative Video Storytelling โ€” score 25 Sources: huggingface

While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present

๐ŸŸข Step-Audio-R1.5 Technical Report โ€” score 15 Sources: huggingface

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the succes

๐ŸŸข Toward Scalable Terminal Task Synthesis via Skill Graphs โ€” score 5 Sources: huggingface

Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for tr

๐Ÿ“„ New Papers

TitleCategoryScoreLink
Recursive Multi-Agent Systemsdeveloper_tool239Open
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corporainfrastructure86Open
DV-World: Benchmarking Data Visualization Agents in Real-World Scenariosdeveloper_tool45Open
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discoverymodel_release29Open
Meta-CoT: Enhancing Granularity and Generalization in Image Editingmodel_release28Open
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbaggingcs.AI0Open
Agent Name Service (ANS): A Proof-of-Concept Trust Layer for Secure AI Agent Discovery, Identity, and Governance in Kubernetescs.AI0Open
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extractioncs.AI0Open
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithmscs.AI0Open
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributionscs.AI0Open
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentationcs.AI0Open
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generationcs.AI0Open
Persuadability and LLMs as Legal Decision Toolscs.AI0Open
LATTICE: Evaluating Decision Support Utility of Crypto Agentscs.AI0Open
Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcomecs.AI0Open

๐Ÿข Lab Blog Posts