๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs โ€” score 95 Sources: huggingface

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals

๐Ÿ”ด MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs โ€” score 75 Sources: huggingface

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimod

Developer Tools

๐Ÿ”ด SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise โ€” score 85 Sources: huggingface

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex a

๐ŸŸก Notable

Model Releases

๐ŸŸก Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions โ€” score 50 Sources: huggingface

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descri

๐ŸŸก OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence โ€” score 50 Sources: huggingface

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have

Developer Tools

๐ŸŸก Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception โ€” score 65 Sources: huggingface

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of i

๐ŸŸข Incremental

Model Releases

๐ŸŸข GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics โ€” score 15 Sources: huggingface

This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought

Developer Tools

๐ŸŸข CoPE-VideoLM: Codec Primitives For Efficient Video Language Models โ€” score 35 Sources: huggingface

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, proce

๐ŸŸข SemanticMoments: Training-Free Motion Similarity via Third Moment Features โ€” score 25 Sources: huggingface

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centri

๐ŸŸข What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis โ€” score 5 Sources: huggingface

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benc

๐Ÿ“„ New Papers

TitleCategoryScoreLink
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMsmodel_release249Open
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noisedeveloper_tool220Open
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMsmodel_release81Open
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perceptiondeveloper_tool66Open
Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructionsmodel_release56Open
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligencemodel_release56Open
Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Studycs.AI0Open
High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplacecs.AI0Open
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)cs.AI0Open
Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Datacs.AI0Open
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problemcs.AI0Open
Competition for attention predicts good-to-bad tipping in AIcs.AI0Open
Differentially Private Retrieval-Augmented Generationcs.AI0Open
Adapting VACE for Real-Time Autoregressive Video Diffusioncs.AI0Open
Hello-Chat: Towards Realistic Social Audio Interactionscs.AI0Open