π΄ High Significance
Model Releases
π΄ GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning β score 95
Sources: huggingface
Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live compet
Developer Tools
π΄ InCoder-32B-Thinking: Industrial Code World Model for Thinking β score 85
Sources: huggingface
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Tho
π΄ Self-Distilled RLVR β score 75
Sources: huggingface
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains s
π‘ Notable
Model Releases
π‘ Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? β score 55
Sources: huggingface
Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and
Developer Tools
π‘ A Simple Baseline for Streaming Video Understanding β score 65
Sources: huggingface
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published s
π‘ Industrial policy for the Intelligence Age β score 50
Sources: lab_blog/OpenAI
Explore our ambitious, people-first industrial policy ideas for the AI eraβfocused on expanding opportunity, sharing prosperity, and building resilient institutions as advanced intelligence evolves.
Other Signals
π‘ Announcing the OpenAI Safety Fellowship β score 50
Sources: lab_blog/OpenAI
A pilot program to support independent safety and alignment research and develop the next generation of talent
π‘ Token Warping Helps MLLMs Look from Nearby Viewpoints β score 45
Sources: huggingface
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and of
π’ Incremental
Model Releases
π’ Communicating about Space: Language-Mediated Spatial Integration Across Partial Views β score 25
Sources: huggingface
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To
π’ AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents β score 15
Sources: huggingface
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that
Developer Tools
π’ Test-Time Scaling Makes Overtraining Compute-Optimal β score 35
Sources: huggingface
Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size,
π’ Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation β score 5
Sources: huggingface
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on genera
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning | model_release | 378 | Open |
| InCoder-32B-Thinking: Industrial Code World Model for Thinking | developer_tool | 236 | Open |
| Self-Distilled RLVR | developer_tool | 174 | Open |
| A Simple Baseline for Streaming Video Understanding | developer_tool | 80 | Open |
| Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? | model_release | 40 | Open |
| RESCORE: LLM-Driven Simulation Recovery in Control Systems Research Papers | cs.AI | 0 | Open |
| Soft Tournament Equilibrium | cs.AI | 0 | Open |
| GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction | cs.AI | 0 | Open |
| Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications | cs.AI | 0 | Open |
| Thermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systems | cs.AI | 0 | Open |
| Implementing surrogate goals for safer bargaining in LLM-based agents | cs.AI | 0 | Open |
| Domain-Contextualized Inference: A Computable Graph Architecture for Explicit-Domain Reasoning | cs.AI | 0 | Open |
| RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets | cs.AI | 0 | Open |
| ReFinE: Streamlining UI Mockup Iteration with Research Findings | cs.AI | 0 | Open |
| REAM: Merging Improves Pruning of Experts in LLMs | cs.AI | 0 | Open |