AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 SkillClaw: Let Skills Evolve Collectively with Agentic Evolver — score 85 Sources: huggingface

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from im

🔴 ClawBench: Can AI Agents Complete Everyday Online Tasks? — score 75 Sources: huggingface

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that

Infrastructure & Compute

🔴 Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability — score 95 Sources: huggingface

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jo

🟡 Notable

Model Releases

🟡 HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents — score 65 Sources: huggingface

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelli

🟡 ChatGPT for customer success teams — score 50 Sources: lab_blog/OpenAI

Learn how customer success teams use ChatGPT to manage accounts, improve communication, reduce churn, and drive adoption and renewals.

🟡 Using skills — score 50 Sources: lab_blog/OpenAI

Learn how to create and use ChatGPT skills to build reusable workflows, automate recurring tasks, and ensure consistent, high-quality outputs.

🟡 Healthcare — score 50 Sources: lab_blog/OpenAI

Explore how clinicians use ChatGPT to support diagnosis, documentation, and patient care with secure, HIPAA-compliant AI tools.

🟡 Using projects in ChatGPT — score 50 Sources: lab_blog/OpenAI

Learn how to use projects in ChatGPT to organize chats, files, and instructions, manage ongoing work, and collaborate more effectively.

Omitted 19 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

🟡 When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models — score 55 Sources: huggingface

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout incons

🟡 MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping — score 45 Sources: huggingface

In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, wh

Other Signals

🟡 Our response to the Axios developer tool compromise — score 50 Sources: lab_blog/OpenAI

OpenAI responds to the Axios supply chain attack by rotating macOS code signing certificates, updating apps, and confirming no user data was compromised.

🟢 Incremental

Developer Tools

🟢 LPM 1.0: Video-based Character Performance Model — score 35 Sources: huggingface

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve

🟢 DMax: Aggressive Parallel Decoding for dLLMs — score 25 Sources: huggingface

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition,

🟢 Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering — score 15 Sources: huggingface

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the su

🟢 OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks — score 5 Sources: huggingface

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challeng

📄 New Papers

Title	Category	Score	Link
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability	infrastructure	330	Open
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver	model_release	293	Open
ClawBench: Can AI Agents Complete Everyday Online Tasks?	model_release	266	Open
HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents	model_release	192	Open
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models	developer_tool	119	Open
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs	cs.AI	0	Open
MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration	cs.AI	0	Open
Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching	cs.AI	0	Open
ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents	cs.AI	0	Open
Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations	cs.AI	0	Open
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks	cs.AI	0	Open
AI-Induced Human Responsibility (AIHR) in AI-Human teams	cs.AI	0	Open
AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models	cs.AI	0	Open
MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification	cs.AI	0	Open
Semantic Channel Theory: Deductive Compression and Structural Fidelity for Multi-Agent Communication	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: ChatGPT for customer success teams
OpenAI: Using skills
OpenAI: Healthcare
OpenAI: Our response to the Axios developer tool compromise
OpenAI: Using projects in ChatGPT
OpenAI: ChatGPT for research
OpenAI: Personalizing ChatGPT
OpenAI: Brainstorming with ChatGPT
OpenAI: Writing with ChatGPT
OpenAI: Responsible and safe use of AI
OpenAI: Research with ChatGPT
OpenAI: Getting started with ChatGPT
OpenAI: Creating images with ChatGPT
OpenAI: ChatGPT for operations teams
OpenAI: Analyzing data with ChatGPT
OpenAI: ChatGPT for managers
OpenAI: Financial services
OpenAI: AI fundamentals
OpenAI: Applications of AI at OpenAI
OpenAI: Prompting fundamentals
OpenAI: ChatGPT for sales teams
OpenAI: ChatGPT for finance teams
OpenAI: Working with files in ChatGPT
OpenAI: Using custom GPTs

AI Watchtower Briefing — 2026-04-10

🔴 High Significance

Model Releases

Infrastructure & Compute

🟡 Notable

Model Releases

Developer Tools

Other Signals

🟢 Incremental

Developer Tools

📄 New Papers

🏢 Lab Blog Posts