πŸ”΄ High Significance

Model Releases

πŸ”΄ Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters β€” score 95 Sources: huggingface

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with

πŸ”΄ GENIUS: Generative Fluid Intelligence Evaluation Suite β€” score 75 Sources: huggingface

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity t

Developer Tools

πŸ”΄ VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval β€” score 85 Sources: huggingface

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we fo

🟑 Notable

Model Releases

🟑 ASA: Training-Free Representation Engineering for Tool-Calling Agents β€” score 55 Sources: huggingface

Adapting LLM agents to domain-specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is easy to deploy but often fragile under distribution shift and strict parsers, while continual parameter-efficient fine-tuning improves reliability at the cost of tr

🟑 Introducing GPT-5.3-Codex-Spark β€” score 50 Sources: lab_blog/OpenAI

Introducing GPT-5.3-Codex-Sparkβ€”our first real-time coding model. 15x faster generation, 128k context, now in research preview for ChatGPT Pro users.

🟑 Gemini 3 Deep Think: Advancing science, research and engineering β€” score 50 Sources: lab_blog/DeepMind

Our most specialized reasoning mode is now updated to solve modern science, research and engineering challenges.

🟑 Towards Autonomous Mathematics Research β€” score 45 Sources: huggingface

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing l

Developer Tools

🟑 PhyCritic: Multimodal Critic Models for Physical AI β€” score 65 Sources: huggingface

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existin

🟒 Incremental

Model Releases

🟒 TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions β€” score 10 Sources: huggingface

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling reade

Developer Tools

🟒 When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning β€” score 35 Sources: huggingface

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an

🟒 How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning β€” score 25 Sources: huggingface

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention ma

🟒 G-LNS: Generative Large Neighborhood Search for LLM-Based Automatic Heuristic Design β€” score 10 Sources: huggingface

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing approaches typically formulate AHD around constructive priority rules or parameterized local search guidance, thereby restricting the search space to fixed heuristic forms. Such designs offer

πŸ“„ New Papers

TitleCategoryScoreLink
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parametersmodel_release202Open
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrievaldeveloper_tool126Open
GENIUS: Generative Fluid Intelligence Evaluation Suitemodel_release58Open
PhyCritic: Multimodal Critic Models for Physical AIdeveloper_tool57Open
ASA: Training-Free Representation Engineering for Tool-Calling Agentsmodel_release44Open
From Noise to Order: Learning to Rank via Denoising Diffusioncs.AI0Open
Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoningcs.AI0Open
EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement & Routing for RF Circuitscs.AI0Open
Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idriscs.AI0Open
Understanding Persuasive Interactions between Generative Social Agents and Humans: The Knowledge-based Persuasion Model (KPM)cs.AI0Open
IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflectioncs.AI0Open
RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysiscs.AI0Open
Multimodal Fact-Level Attribution for Verifiable Reasoningcs.AI0Open
AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systemscs.AI0Open
Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Promptcs.AI0Open

🏒 Lab Blog Posts