๐ด High Significance
Developer Tools
๐ด ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents โ score 95
Sources: huggingface
GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity tha
๐ด KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance โ score 85
Sources: huggingface
RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which int
๐ด Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe โ score 75
Sources: huggingface
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds
๐ก Notable
Model Releases
๐ก Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning โ score 55
Sources: huggingface
We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mix
๐ก Gemini 3.1 Flash TTS: the next generation of expressive AI speech โ score 50
Sources: lab_blog/DeepMind
Our newest audio model introduces granular audio tags that give you precise control to direct AI speech for expressive audio generation.
Developer Tools
๐ก Lyra 2.0: Explorable Generative 3D Worlds โ score 65
Sources: huggingface
Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative
๐ก The next evolution of the Agents SDK โ score 50
Sources: lab_blog/OpenAI
OpenAI updates the Agents SDK with native sandbox execution and a model-native harness, helping developers build secure, long-running agents across files and tools.
๐ก Toward Autonomous Long-Horizon Engineering for ML Research โ score 45
Sources: huggingface
Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for auton
๐ข Incremental
Model Releases
๐ข BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation โ score 25
Sources: huggingface
Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a mode
๐ข The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents โ score 5
Sources: huggingface
Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a su
Developer Tools
๐ข Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization โ score 25
Sources: huggingface
The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization c
๐ข SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks โ score 25
Sources: huggingface
Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohib
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents | developer_tool | 149 | Open |
| KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance | developer_tool | 103 | Open |
| Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe | developer_tool | 94 | Open |
| Lyra 2.0: Explorable Generative 3D Worlds | developer_tool | 43 | Open |
| Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning | model_release | 37 | Open |
| A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings | cs.AI | 0 | Open |
| Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health | cs.AI | 0 | Open |
| On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problem | cs.AI | 0 | Open |
| ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold | cs.AI | 0 | Open |
| Quantifying and Understanding Uncertainty in Large Reasoning Models | cs.AI | 0 | Open |
| From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning | cs.AI | 0 | Open |
| Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence | cs.AI | 0 | Open |
| DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis | cs.AI | 0 | Open |
| The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability | cs.AI | 0 | Open |
| MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments | cs.AI | 0 | Open |