AW · AI Watchtower

🔴 High Significance

Developer Tools

🔴 GLM-5: from Vibe Coding to Agentic Engineering — score 95 Sources: huggingface

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintain

🔴 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks — score 85 Sources: huggingface

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and determinist

Business & Funding

🔴 Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? — score 75 Sources: huggingface

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a

🟡 Notable

Model Releases

🟡 Introducing OpenAI for India — score 50 Sources: lab_blog/OpenAI

OpenAI for India expands AI access across the country—building local infrastructure, powering enterprises, and advancing workforce skills.

🟡 Introducing EVMbench — score 50 Sources: lab_blog/OpenAI

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

🟡 A new way to express yourself: Gemini can now create music — score 50 Sources: lab_blog/DeepMind

The Gemini app now features our most advanced music generation model Lyria 3, empowering anyone to make 30-second tracks using text or images.

🟡 A Trajectory-Based Safety Audit of Clawdbot (OpenClaw) — score 45 Sources: huggingface

Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk

Developer Tools

🟡 Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook — score 65 Sources: huggingface

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agen

🟡 jina-embeddings-v5-text: Task-Targeted Embedding Distillation — score 55 Sources: huggingface

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combin

🟢 Incremental

Model Releases

🟢 ResearchGym: Evaluating Language Model Agents on Real-World AI Research — score 35 Sources: huggingface

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline impleme

🟢 HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam — score 5 Sources: huggingface

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distor

Developer Tools

🟢 UniT: Unified Multimodal Chain-of-Thought Test-time Scaling — score 25 Sources: huggingface

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, o

🟢 Revisiting the Platonic Representation Hypothesis: An Aristotelian View — score 15 Sources: huggingface

The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can system

📄 New Papers

Title	Category	Score	Link
GLM-5: from Vibe Coding to Agentic Engineering	developer_tool	155	Open
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks	developer_tool	64	Open
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?	business_funding	59	Open
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook	developer_tool	31	Open
jina-embeddings-v5-text: Task-Targeted Embedding Distillation	developer_tool	28	Open
Measuring and Eliminating Refusals in Military Large Language Models	cs.AI	0	Open
GPSBench: Do Large Language Models Understand GPS Coordinates?	cs.AI	0	Open
Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis	cs.AI	0	Open
Federated Graph AGI for Cross-Border Insider Threat Intelligence in Government Financial Schemes	cs.AI	0	Open
OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis	cs.AI	0	Open
Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing	cs.AI	0	Open
Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System	cs.AI	0	Open
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning	cs.AI	0	Open
Retrieval Collapses When AI Pollutes the Web	cs.AI	0	Open
Human-AI Collaboration in Large Language Model-Integrated Building Energy Management Systems: The Role of User Domain Knowledge and AI Literacy	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Introducing OpenAI for India
OpenAI: Introducing EVMbench
DeepMind: A new way to express yourself: Gemini can now create music

AI Watchtower Briefing — 2026-02-18

🔴 High Significance

Developer Tools

Business & Funding

🟡 Notable

Model Releases

Developer Tools

🟢 Incremental

Model Releases

Developer Tools

📄 New Papers

🏢 Lab Blog Posts