๐ด High Significance
Model Releases
๐ด Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs โ score 95
Sources: huggingface
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals
๐ด MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs โ score 75
Sources: huggingface
We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimod
Developer Tools
๐ด SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise โ score 85
Sources: huggingface
Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex a
๐ก Notable
Model Releases
๐ก Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions โ score 50
Sources: huggingface
Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descri
๐ก OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence โ score 50
Sources: huggingface
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have
Developer Tools
๐ก Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception โ score 65
Sources: huggingface
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of i
๐ข Incremental
Model Releases
๐ข GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics โ score 15
Sources: huggingface
This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought
Developer Tools
๐ข CoPE-VideoLM: Codec Primitives For Efficient Video Language Models โ score 35
Sources: huggingface
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, proce
๐ข SemanticMoments: Training-Free Motion Similarity via Third Moment Features โ score 25
Sources: huggingface
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centri
๐ข What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis โ score 5
Sources: huggingface
Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benc
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs | model_release | 249 | Open |
| SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise | developer_tool | 220 | Open |
| MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs | model_release | 81 | Open |
| Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception | developer_tool | 66 | Open |
| Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions | model_release | 56 | Open |
| OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence | model_release | 56 | Open |
| Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study | cs.AI | 0 | Open |
| High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace | cs.AI | 0 | Open |
| A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) | cs.AI | 0 | Open |
| Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data | cs.AI | 0 | Open |
| InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem | cs.AI | 0 | Open |
| Competition for attention predicts good-to-bad tipping in AI | cs.AI | 0 | Open |
| Differentially Private Retrieval-Augmented Generation | cs.AI | 0 | Open |
| Adapting VACE for Real-Time Autoregressive Video Diffusion | cs.AI | 0 | Open |
| Hello-Chat: Towards Realistic Social Audio Interactions | cs.AI | 0 | Open |