๐Ÿ”ด High Significance

Model Releases

๐Ÿ”ด ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers โ€” score 95 Sources: huggingface

OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming

๐Ÿ”ด Terminal Agents Suffice for Enterprise Automation โ€” score 85 Sources: huggingface

There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through grap

Developer Tools

๐Ÿ”ด MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome โ€” score 75 Sources: huggingface

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthe

๐ŸŸก Notable

Model Releases

๐ŸŸก Embarrassingly Simple Self-Distillation Improves Code Generation โ€” score 65 Sources: huggingface

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation config

๐ŸŸก Codex now offers more flexible pricing for teams โ€” score 50 Sources: lab_blog/OpenAI

Codex now includes pay-as-you-go pricing for ChatGPT Business and Enterprise, providing teams a more flexible option to start and scale adoption.

Developer Tools

๐ŸŸก ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? โ€” score 55 Sources: huggingface

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks

๐ŸŸก Gemma 4: Byte for byte, the most capable open models โ€” score 50 Sources: lab_blog/DeepMind

Gemma 4: Our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.

๐ŸŸก Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification โ€” score 45 Sources: huggingface

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static

Other Signals

๐ŸŸก OpenAI acquires TBPN โ€” score 50 Sources: lab_blog/OpenAI

OpenAI acquires TBPN to accelerate global conversations around AI and support independent media, expanding dialogue with builders, businesses, and the broader tech community.

๐ŸŸข Incremental

Model Releases

๐ŸŸข QuitoBench: A High-Quality Open Time Series Forecasting Benchmark โ€” score 25 Sources: huggingface

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting wi

๐ŸŸข GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation โ€” score 5 Sources: huggingface

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene g

Developer Tools

๐ŸŸข Reasoning Shift: How Context Silently Shortens LLM Reasoning โ€” score 35 Sources: huggingface

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this

๐ŸŸข HippoCamp: Benchmarking Contextual Agents on Personal Computers โ€” score 15 Sources: huggingface

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to m

๐Ÿ“„ New Papers

TitleCategoryScoreLink
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchersmodel_release187Open
Terminal Agents Suffice for Enterprise Automationmodel_release103Open
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcomedeveloper_tool75Open
Embarrassingly Simple Self-Distillation Improves Code Generationmodel_release54Open
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?developer_tool46Open
Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Oncecs.AI0Open
ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systemscs.AI0Open
LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automationcs.AI0Open
ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agentscs.AI0Open
A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policiescs.AI0Open
PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenancecs.AI0Open
Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Mergingcs.AI0Open
What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labelingcs.AI0Open
RAE-AR: Taming Autoregressive Models with Representation Autoencoderscs.AI0Open
Automating Database-Native Function Code Synthesis with LLMscs.AI0Open

๐Ÿข Lab Blog Posts