๐ด High Significance
Model Releases
๐ด ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers โ score 95
Sources: huggingface
OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming
๐ด Terminal Agents Suffice for Enterprise Automation โ score 85
Sources: huggingface
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through grap
Developer Tools
๐ด MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome โ score 75
Sources: huggingface
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthe
๐ก Notable
Model Releases
๐ก Embarrassingly Simple Self-Distillation Improves Code Generation โ score 65
Sources: huggingface
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation config
๐ก Codex now offers more flexible pricing for teams โ score 50
Sources: lab_blog/OpenAI
Codex now includes pay-as-you-go pricing for ChatGPT Business and Enterprise, providing teams a more flexible option to start and scale adoption.
Developer Tools
๐ก ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? โ score 55
Sources: huggingface
Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks
๐ก Gemma 4: Byte for byte, the most capable open models โ score 50
Sources: lab_blog/DeepMind
Gemma 4: Our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.
๐ก Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification โ score 45
Sources: huggingface
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static
Other Signals
๐ก OpenAI acquires TBPN โ score 50
Sources: lab_blog/OpenAI
OpenAI acquires TBPN to accelerate global conversations around AI and support independent media, expanding dialogue with builders, businesses, and the broader tech community.
๐ข Incremental
Model Releases
๐ข QuitoBench: A High-Quality Open Time Series Forecasting Benchmark โ score 25
Sources: huggingface
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting wi
๐ข GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation โ score 5
Sources: huggingface
Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene g
Developer Tools
๐ข Reasoning Shift: How Context Silently Shortens LLM Reasoning โ score 35
Sources: huggingface
Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this
๐ข HippoCamp: Benchmarking Contextual Agents on Personal Computers โ score 15
Sources: huggingface
We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to m
๐ New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers | model_release | 187 | Open |
| Terminal Agents Suffice for Enterprise Automation | model_release | 103 | Open |
| MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome | developer_tool | 75 | Open |
| Embarrassingly Simple Self-Distillation Improves Code Generation | model_release | 54 | Open |
| ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? | developer_tool | 46 | Open |
| Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once | cs.AI | 0 | Open |
| ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems | cs.AI | 0 | Open |
| LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation | cs.AI | 0 | Open |
| ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents | cs.AI | 0 | Open |
| A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies | cs.AI | 0 | Open |
| PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance | cs.AI | 0 | Open |
| Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging | cs.AI | 0 | Open |
| What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling | cs.AI | 0 | Open |
| RAE-AR: Taming Autoregressive Models with Representation Autoencoders | cs.AI | 0 | Open |
| Automating Database-Native Function Code Synthesis with LLMs | cs.AI | 0 | Open |