πŸ”΄ High Significance

Model Releases

πŸ”΄ GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning β€” score 95 Sources: huggingface

Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live compet

Developer Tools

πŸ”΄ InCoder-32B-Thinking: Industrial Code World Model for Thinking β€” score 85 Sources: huggingface

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Tho

πŸ”΄ Self-Distilled RLVR β€” score 75 Sources: huggingface

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains s

🟑 Notable

Model Releases

🟑 Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? β€” score 55 Sources: huggingface

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and

Developer Tools

🟑 A Simple Baseline for Streaming Video Understanding β€” score 65 Sources: huggingface

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published s

🟑 Industrial policy for the Intelligence Age β€” score 50 Sources: lab_blog/OpenAI

Explore our ambitious, people-first industrial policy ideas for the AI eraβ€”focused on expanding opportunity, sharing prosperity, and building resilient institutions as advanced intelligence evolves.

Other Signals

🟑 Announcing the OpenAI Safety Fellowship β€” score 50 Sources: lab_blog/OpenAI

A pilot program to support independent safety and alignment research and develop the next generation of talent

🟑 Token Warping Helps MLLMs Look from Nearby Viewpoints β€” score 45 Sources: huggingface

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and of

🟒 Incremental

Model Releases

🟒 Communicating about Space: Language-Mediated Spatial Integration Across Partial Views β€” score 25 Sources: huggingface

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To

🟒 AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents β€” score 15 Sources: huggingface

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that

Developer Tools

🟒 Test-Time Scaling Makes Overtraining Compute-Optimal β€” score 35 Sources: huggingface

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size,

🟒 Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation β€” score 5 Sources: huggingface

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on genera

πŸ“„ New Papers

TitleCategoryScoreLink
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learningmodel_release378Open
InCoder-32B-Thinking: Industrial Code World Model for Thinkingdeveloper_tool236Open
Self-Distilled RLVRdeveloper_tool174Open
A Simple Baseline for Streaming Video Understandingdeveloper_tool80Open
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?model_release40Open
RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Paperscs.AI0Open
Soft Tournament Equilibriumcs.AI0Open
GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstructioncs.AI0Open
Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applicationscs.AI0Open
Thermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systemscs.AI0Open
Implementing surrogate goals for safer bargaining in LLM-based agentscs.AI0Open
Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoningcs.AI0Open
RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgetscs.AI0Open
ReFinE: Streamlining UI Mockup Iteration with Research Findingscs.AI0Open
REAM: Merging Improves Pruning of Experts in LLMscs.AI0Open

🏒 Lab Blog Posts