๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier โ€” score 85 Sources: huggingface

While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, P(hypothesis|background) (P(h|b)), unexplored. We demonstrate that directly training P(h|b) is

Developer Tools

๐Ÿ”ด SkillNet: Create, Evaluate, and Connect AI Skills โ€” score 95 Sources: huggingface

Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in

๐Ÿ”ด DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval โ€” score 75 Sources: huggingface

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore dat

๐ŸŸก Notable

Model Releases

๐ŸŸก AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios โ€” score 65 Sources: huggingface

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under ro

Developer Tools

๐ŸŸก RoboPocket: Improve Robot Policies Instantly with Your Phone โ€” score 55 Sources: huggingface

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing th

๐ŸŸก Codex Security: now in research preview โ€” score 50 Sources: lab_blog/OpenAI

Codex Security is an AI application security agent that analyzes project context to detect, validate, and patch complex vulnerabilities with higher confidence and less noise.

๐ŸŸก How Descript engineers multilingual video dubbing at scale โ€” score 50 Sources: lab_blog/OpenAI

Using OpenAI reasoning models, Descript unlocked automatic localization of large content libraries without losing timing or meaning.

๐ŸŸก How Balyasny Asset Management built an AI research engine โ€” score 50 Sources: lab_blog/OpenAI

By combining rigorous model evaluation, full-platform use of OpenAI, and agent workflows, Balyasny is reinventing investment research.

๐ŸŸก MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models โ€” score 45 Sources: huggingface

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and ide

๐ŸŸข Incremental

Model Releases

๐ŸŸข Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling โ€” score 5 Sources: huggingface

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial

Developer Tools

๐ŸŸข HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images โ€” score 35 Sources: huggingface

Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-

๐ŸŸข Interactive Benchmarks โ€” score 25 Sources: huggingface

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm tha

๐ŸŸข Large Multimodal Models as General In-Context Classifiers โ€” score 15 Sources: huggingface

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tas

๐Ÿ“„ New Papers

TitleCategoryScoreLink
SkillNet: Create, Evaluate, and Connect AI Skillsdeveloper_tool99Open
MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barriermodel_release95Open
DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrievaldeveloper_tool56Open
AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenariosmodel_release47Open
RoboPocket: Improve Robot Policies Instantly with Your Phonedeveloper_tool39Open
Bridging Domains through Subspace-Aware Model Mergingcs.AI0Open
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Headscs.AI0Open
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Modelscs.AI0Open
PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Modelscs.AI0Open
Balancing Domestic and Global Perspectives: Evaluating Dual-Calibration and LLM-Generated Nudges for Diverse News Recommendationcs.AI0Open
Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrievalcs.AI0Open
Proof-of-Guardrail in AI Agents and What (Not) to Trust from Itcs.AI0Open
ProtAlign: Contrastive learning paradigm for Sequence and structure alignmentcs.AI0Open
AWPD: Frequency Shield Network for Agnostic Watermark Presence Detectioncs.AI0Open
Bi Directional Feedback Fusion for Activity Aware Forecasting of Indoor CO2 and PM2.5cs.AI0Open

๐Ÿข Lab Blog Posts