πŸ”΄ High Significance

Developer Tools

πŸ”΄ GLM-5: from Vibe Coding to Agentic Engineering β€” score 95 Sources: huggingface

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintain

πŸ”΄ SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks β€” score 85 Sources: huggingface

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and determinist

Business & Funding

πŸ”΄ Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? β€” score 75 Sources: huggingface

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a

🟑 Notable

Model Releases

🟑 Introducing OpenAI for India β€” score 50 Sources: lab_blog/OpenAI

OpenAI for India expands AI access across the countryβ€”building local infrastructure, powering enterprises, and advancing workforce skills.

🟑 Introducing EVMbench β€” score 50 Sources: lab_blog/OpenAI

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

🟑 A new way to express yourself: Gemini can now create music β€” score 50 Sources: lab_blog/DeepMind

The Gemini app now features our most advanced music generation model Lyria 3, empowering anyone to make 30-second tracks using text or images.

🟑 A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) β€” score 45 Sources: huggingface

Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk

Developer Tools

🟑 Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook β€” score 65 Sources: huggingface

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agen

🟑 jina-embeddings-v5-text: Task-Targeted Embedding Distillation β€” score 55 Sources: huggingface

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combin

🟒 Incremental

Model Releases

🟒 ResearchGym: Evaluating Language Model Agents on Real-World AI Research β€” score 35 Sources: huggingface

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline impleme

🟒 HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam β€” score 5 Sources: huggingface

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distor

Developer Tools

🟒 UniT: Unified Multimodal Chain-of-Thought Test-time Scaling β€” score 25 Sources: huggingface

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, o

🟒 Revisiting the Platonic Representation Hypothesis: An Aristotelian View β€” score 15 Sources: huggingface

The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can system

πŸ“„ New Papers

TitleCategoryScoreLink
GLM-5: from Vibe Coding to Agentic Engineeringdeveloper_tool155Open
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasksdeveloper_tool64Open
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?business_funding59Open
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbookdeveloper_tool31Open
jina-embeddings-v5-text: Task-Targeted Embedding Distillationdeveloper_tool28Open
Measuring and Eliminating Refusals in Military Large Language Modelscs.AI0Open
GPSBench: Do Large Language Models Understand GPS Coordinates?cs.AI0Open
Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysiscs.AI0Open
Federated Graph AGI for Cross-Border Insider Threat Intelligence in Government Financial Schemescs.AI0Open
OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysiscs.AI0Open
Surrogate-Based Prevalence Measurement for Large-Scale A/B Testingcs.AI0Open
Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation Systemcs.AI0Open
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoningcs.AI0Open
Retrieval Collapses When AI Pollutes the Webcs.AI0Open
Human-AI Collaboration in Large Language Model-Integrated Building Energy Management Systems: The Role of User Domain Knowledge and AI Literacycs.AI0Open

🏒 Lab Blog Posts