π΄ High Significance
Developer Tools
π΄ GLM-5: from Vibe Coding to Agentic Engineering β score 95
Sources: huggingface
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintain
π΄ SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks β score 85
Sources: huggingface
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and determinist
Business & Funding
π΄ Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? β score 75
Sources: huggingface
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a
π‘ Notable
Model Releases
π‘ Introducing OpenAI for India β score 50
Sources: lab_blog/OpenAI
OpenAI for India expands AI access across the countryβbuilding local infrastructure, powering enterprises, and advancing workforce skills.
π‘ Introducing EVMbench β score 50
Sources: lab_blog/OpenAI
OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agentsβ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.
π‘ A new way to express yourself: Gemini can now create music β score 50
Sources: lab_blog/DeepMind
The Gemini app now features our most advanced music generation model Lyria 3, empowering anyone to make 30-second tracks using text or images.
π‘ A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) β score 45
Sources: huggingface
Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk
Developer Tools
π‘ Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook β score 65
Sources: huggingface
As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agen
π‘ jina-embeddings-v5-text: Task-Targeted Embedding Distillation β score 55
Sources: huggingface
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combin
π’ Incremental
Model Releases
π’ ResearchGym: Evaluating Language Model Agents on Real-World AI Research β score 35
Sources: huggingface
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline impleme
π’ HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam β score 5
Sources: huggingface
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distor
Developer Tools
π’ UniT: Unified Multimodal Chain-of-Thought Test-time Scaling β score 25
Sources: huggingface
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, o
π’ Revisiting the Platonic Representation Hypothesis: An Aristotelian View β score 15
Sources: huggingface
The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can system
π New Papers
| Title | Category | Score | Link |
|---|---|---|---|
| GLM-5: from Vibe Coding to Agentic Engineering | developer_tool | 155 | Open |
| SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks | developer_tool | 64 | Open |
| Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? | business_funding | 59 | Open |
| Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook | developer_tool | 31 | Open |
| jina-embeddings-v5-text: Task-Targeted Embedding Distillation | developer_tool | 28 | Open |
| Measuring and Eliminating Refusals in Military Large Language Models | cs.AI | 0 | Open |
| GPSBench: Do Large Language Models Understand GPS Coordinates? | cs.AI | 0 | Open |
| Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis | cs.AI | 0 | Open |
| Federated Graph AGI for Cross-Border Insider Threat Intelligence in Government Financial Schemes | cs.AI | 0 | Open |
| OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis | cs.AI | 0 | Open |
| Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing | cs.AI | 0 | Open |
| Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System | cs.AI | 0 | Open |
| DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning | cs.AI | 0 | Open |
| Retrieval Collapses When AI Pollutes the Web | cs.AI | 0 | Open |
| Human-AI Collaboration in Large Language Model-Integrated Building Energy Management Systems: The Role of User Domain Knowledge and AI Literacy | cs.AI | 0 | Open |
π’ Lab Blog Posts
- OpenAI: Introducing OpenAI for India
- OpenAI: Introducing EVMbench
- DeepMind: A new way to express yourself: Gemini can now create music