๐Ÿ”ด High Significance

Developer Tools

๐Ÿ”ด ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents โ€” score 95 Sources: huggingface

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity tha

๐Ÿ”ด KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance โ€” score 85 Sources: huggingface

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which int

๐Ÿ”ด Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe โ€” score 75 Sources: huggingface

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds

๐ŸŸก Notable

Model Releases

๐ŸŸก Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning โ€” score 55 Sources: huggingface

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mix

๐ŸŸก Gemini 3.1 Flash TTS: the next generation of expressive AI speech โ€” score 50 Sources: lab_blog/DeepMind

Our newest audio model introduces granular audio tags that give you precise control to direct AI speech for expressive audio generation.

Developer Tools

๐ŸŸก Lyra 2.0: Explorable Generative 3D Worlds โ€” score 65 Sources: huggingface

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative

๐ŸŸก The next evolution of the Agents SDK โ€” score 50 Sources: lab_blog/OpenAI

OpenAI updates the Agents SDK with native sandbox execution and a model-native harness, helping developers build secure, long-running agents across files and tools.

๐ŸŸก Toward Autonomous Long-Horizon Engineering for ML Research โ€” score 45 Sources: huggingface

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for auton

๐ŸŸข Incremental

Model Releases

๐ŸŸข BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation โ€” score 25 Sources: huggingface

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a mode

๐ŸŸข The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents โ€” score 5 Sources: huggingface

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a su

Developer Tools

๐ŸŸข Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization โ€” score 25 Sources: huggingface

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization c

๐ŸŸข SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks โ€” score 25 Sources: huggingface

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohib

๐Ÿ“„ New Papers

TitleCategoryScoreLink
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agentsdeveloper_tool149Open
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidancedeveloper_tool103Open
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipedeveloper_tool94Open
Lyra 2.0: Explorable Generative 3D Worldsdeveloper_tool43Open
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoningmodel_release37Open
A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settingscs.AI0Open
Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental healthcs.AI0Open
On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problemcs.AI0Open
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffoldcs.AI0Open
Quantifying and Understanding Uncertainty in Large Reasoning Modelscs.AI0Open
From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learningcs.AI0Open
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependencecs.AI0Open
DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesiscs.AI0Open
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliabilitycs.AI0Open
MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environmentscs.AI0Open

๐Ÿข Lab Blog Posts