Weekly Narrative
This week’s AI stack moved less like a single model-release cycle and more like a convergence around agent infrastructure: orchestration, skill systems, evidence trails, local code context, and inference economics.
On the enterprise side, Mistral introduced a public preview of Workflows, positioning it as an orchestration layer for running AI reliably in production. That matters because the product pitch is no longer “we have a capable model,” but “we can wire capable models into repeatable operational flows.” Anthropic’s acquisition of StainlessAPI points in the same direction from the developer platform side: Stainless has powered Anthropic SDKs and MCP server infrastructure, so bringing it in-house tightens the loop between model APIs, generated SDKs, and agent-facing integration surfaces. xAI also pushed outward through agent integrations, letting Grok and X Premium subscriptions work inside Hermes Agent and OpenClaw, including X post search, image/video generation, and chat. Grok Build also entered early beta as an agentic CLI for coding and workflow automation.
The open-source agent ecosystem was unusually dense. Anthropic’s public skills repo and claude-plugins-official formalized Agent Skills as reusable, inspectable capability bundles. Nearby, tech-leads-club/agent-skills, Imbad0202/academic-research-skills, and K-Dense-AI/scientific-agent-skills show the community converging on skill registries for coding, research, science, finance, and writing workflows. The research paper “SkillsVote” adds a governance layer to this trend, treating agent skills as lifecycle-managed artifacts collected, recommended, and evolved from long-horizon traces rather than loose prompt snippets.
Code-agent infrastructure also kept fragmenting into specialized tools. colbymchenry/codegraph offers a local pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent, aiming to cut tool calls and token use. Lum1104/Understand-Anything takes a similar graph-first route for interactive code understanding. rtk-ai/rtk attacks the same cost problem from the shell, proxying common dev commands to reduce LLM token consumption by 60-90%. rohitg00/agentmemory focuses on persistent memory for coding agents, while git-ai-project/git-ai tracks AI-generated code in repos. The recurring idea is that agents need durable context, provenance, and cheaper interfaces to existing developer systems, not just larger context windows.
The paper stream reinforced that point. “RoadmapBench” evaluates long-horizon agentic software development across version upgrades, while “PBT-Bench” tests agents on property-based testing and “CHI-Bench” asks whether agents can automate policy-heavy healthcare workflows. “Argus” frames deep research as evidence assembly, and “Hallucination as Exploit” pushes toward evidence-carrying multimodal agents. “Fine-grained Claim-level RAG Benchmark for Law” and “CiteVQA” both tighten evaluation around attribution: not just whether a system answers, but whether it can point to the right supporting claim, document region, or legal evidence.
Long-context and attention work stayed active. “Long Context Pre-Training with Lighthouse Attention” proposes a training-only symmetrical selection mechanism to reduce the SDPA bottleneck at extreme sequence lengths. “Full Attention Strikes Back” claims sparse attention can absorb full-attention behavior within roughly a hundred training steps, while “Exact Linear Attention” continues the search for lower-cost attention without giving up exactness. “CODA” attacks transformer efficiency lower in the stack, rewriting blocks as GEMM-epilogue programs. Meanwhile “The Silent Hyperparameter” calls out inference backends as a reproducibility variable, which fits the community’s growing obsession with serving details, hardware comparisons, and local deployment quirks.
Model chatter was strongest around open weights. LocalLLaMA tracked excitement over Qwen 3.7, including expectations for new 27B and 122B variants, while another thread noted MTP approval for llama.cpp. NuExtract3, an Apache-2.0 4B VLM based on Qwen3.5-4B, targeted OCR, Markdown, and structured extraction. ByteDance’s Lance drew attention as a 3B unified multimodal model for image and video understanding, generation, and editing. On the infrastructure side, discussions compared M5 Macs, DGX Spark, Strix Halo, and RTX 6000 setups, while ai-dynamo/dynamo appeared as a datacenter-scale distributed inference serving framework.
Multimodal research widened in both capability and skepticism. “Video2GUI” synthesizes large-scale GUI interaction trajectories for generalized GUI-agent pretraining. “When Vision Speaks for Sound” argues that video-capable MLLMs often infer audio from visual cues rather than actually verifying sound. “DeltaPrompts,” “Vision-OPD,” “Fill the GAP,” and “FineBench” all probe finer-grained visual reasoning, distillation, and human activity understanding. On the generative side, HKUDS’s ViMax describes an all-in-one agentic video generation stack, HeyGen’s hyperframes renders video from HTML for agents, and NVLabs’ Sana continued to represent efficient high-resolution image synthesis.
The week’s community layer was unusually governance-heavy. Anthropic published its US-China AI competition paper, expanded philanthropic commitments with the Gates Foundation, released an audiobook version of Claude’s Constitution, and convened dialogues with scholars and ethicists. Karpathy announced he joined Anthropic, while also continuing to argue that LLMs are more than accelerators for existing workflows. The Arxiv debate over banning authors of papers with hallucinated references showed the research community trying to set enforcement norms for AI-assisted publication. And a prompt-injection story, where a “harmless” prompt leaked internal architecture details, was a reminder that agent security is now a systems problem, matching the paper of the same name almost too neatly.
Recurring Titles
- @MistralAI: 🆕 Today, we're releasing the public preview of Workflows, the orchestration layer for enterprise AI. 🌎 Enterprise teams have capable models. What they don't have is a way to run them reliably in prod — 7 days
- @AnthropicAI: We've published a paper that explains our views on AI competition between the US and China. The US and democratic allies hold the lead in frontier AI today. Read more on what it’ll take to keep that — 7 days
- @xai: You can now use X Premium subscriptions in Hermes Agent, and Hermes Agent can now search X posts. https://x.ai/news/grok-hermes — 7 days
- @MistralAI: Mistral AI made the TIME100 Most Influential Companies list for 2026 — and the top 10 for AI. Why we're proud: customers run frontier models in production on their own terms, on their own infrastruct — 7 days
- @karpathy: This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to prese — 7 days
- @karpathy: This is the the quote I've been citing a lot recently. — 7 days
- @karpathy: Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights: The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). T — 7 days
- @ilyasut: It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and — 7 days
- @ilyasut: One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing. — 7 days
- @ilyasut: Important work — 7 days
- @ilyasut: truly the greatest day ever🎗️ — 7 days
- @ilyasut: a revolutionary breakthrough if i've ever seen one — 7 days
- tinyhumansai/openhuman — Your Personal AI super intelligence. Private, Simple and extremely powerful. — 6 days
- HKUDS/CLI-Anything — "CLI-Anything: Making ALL Software Agent-Native" -- CLI-Hub:https://clianything.cc/ — 6 days
- @AnthropicAI: We’re partnering with the Gates Foundation, committing $200 million in grants, Claude credits, and technical support to programs in global health, life sciences, education, agriculture, and economic m — 6 days
- @AnthropicAI: Claude's Constitution is now an audiobook, read by two of its authors, Amanda Askell and Joe Carlsmith. It includes a Q&A on the writing process, the philosophies that shaped the document, and how it — 6 days
- ruvnet/RuView — π RuView turns commodity WiFi signals into real-time spatial intelligence, vital sign monitoring, and presence detection — all without a single pixel of video. — 5 days
- BigBodyCobain/Shadowbroker — Open-source intelligence for the global theater. Track everything from the corporate/private jets of the wealthy, and spy satellites, to seismic events in one unified interface. Hook an AI agent up to have it parse through data and find previously unseen correlations. The knowledge is available to all but rarely aggregated in the open, until now. — 5 days
- @xai: You can now use your @grok subscription inside @NousResearch Hermes Agent. http://x.ai/news/grok-hermes — 5 days
- colbymchenry/codegraph — Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode — fewer tokens, fewer tool calls, 100% local — 5 days
- HKUDS/ViMax — "ViMax: Agentic Video Generation (Director, Screenwriter, Producer, and Video Generator All-in-One)" — 5 days
- Argus: Evidence Assembly for Scalable Deep Research Agents — 5 days
- Imbad0202/academic-research-skills — Academic Research Skills for Claude Code: research → write → review → revise → finalize — 5 days
- @AnthropicAI: Anthropic is acquiring @stainlessapi, an SDK and MCP server platform that has powered every Anthropic SDK since the earliest days of our API. Read more: https://www.anthropic.com/news/anthropic-acqui — 5 days
- @mattshumer_: With a few prompt tweaks/strategies, and a switch to Codex 5.5 instead of Opus 4.7, you can get MUCH closer to the SUPERHOT design style. This was a one-shot output! — 5 days
- @mattshumer_: I firmly believe that even the most optimistic people in AI are severely underestimating how big the market for inference is going to be. — 5 days
- K-Dense-AI/scientific-agent-skills — A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing. — 4 days
- rtk-ai/rtk — CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies — 4 days
- KeygraphHQ/shannon — Shannon Lite is an autonomous, white-box AI pentester for web applications and APIs. It analyzes your source code, identifies attack vectors, and executes real exploits to prove vulnerabilities before they reach production. — 4 days
- @simonw: To prepare for my #PyConUS lightning talk this afternoon I decided to track down ALL of the names that @openclaw has used since November, using a script against its GitHub repo Warelay → CLAWDIS → CL — 4 days
- dani-garcia/vaultwarden — Unofficial Bitwarden compatible server written in Rust, formerly known as bitwarden_rs — 4 days
- tech-leads-club/agent-skills — The secure, validated skill registry for professional AI coding agents. Extend Antigravity, Claude Code, Cursor, Copilot and more with absolute confidence. — 4 days
- NVlabs/Sana — SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer — 4 days
- ALSO: Adversarial Online Strategy Optimization for Social Agents — 4 days
- Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning — 4 days
- HoloMotion-1 Technical Report — 4 days
- DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation — 4 days
- Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study — 4 days
- Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory "VaCoAl" as a Substrate for Vector-HaSH and TEM — 4 days
- RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades — 4 days
- Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation — 4 days
- Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning — 4 days
- $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows — 4 days
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models — 4 days
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training — 4 days
- Protocol-Driven Development: Governing Generated Software Through Invariants and Continuous Evidence — 4 days
- rmyndharis/OpenWA — Free, Open Source, Self-Hosted WhatsApp API Gateway — 4 days
- heygen-com/hyperframes — Write HTML. Render video. Built for agents. — 4 days
- ZhuLinsen/daily_stock_analysis — LLM驱动的 A/H/美股智能分析:多数据源行情 + 实时新闻 + LLM决策仪表盘 + 多渠道推送,零成本定时运行,纯白嫖. LLM-powered stock analysis system for A/H/US markets. — 4 days
- General Preference Reinforcement Learning — 4 days
- S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs — 4 days
- anthropics/claude-plugins-official — Official, Anthropic-managed directory of high quality Claude Code Plugins. — 4 days
- Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German — 4 days
- D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market — 4 days
- @AnthropicAI: Over the past few months, we've been holding dialogues with scholars, philosophers, clergy, and ethicists on the questions AI raises—starting with how good character forms. Read more about how we’re — 4 days
- @xai: Starting today, use your Grok or X Premium subscription in @openclaw. Chat with your agent, generate images and videos, or search for X posts. http://x.ai/news/grok-openclaw — 4 days
- @karpathy: Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply pa — 4 days
- @mattshumer_: GET ON THIS GOOGLE!!!!! — 4 days
- vercel-labs/agent-browser — Browser automation CLI for AI agents — 4 days
- anthropics/skills — Public repository for Agent Skills — 3 days
- oven-sh/bun — Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one — 3 days
- dograh-hq/dograh — Open Source Voice Agent Platform — 3 days
- @xai: An early beta of Grok Build, an agentic CLI for coding, building apps, and automating workflows is now available for SuperGrok Heavy subscribers. Through this early beta, we will improve the model an — 3 days
- facebook/pyrefly — A fast type checker and language server for Python — 3 days
- @OpenAI: Another reason to switch to Codex. — 3 days
- @simonw: First talk in our AI track at #PyConUS is coming up at 11am — 3 days
- @mattshumer_: Just wiped the Mac Mini I set up for OpenClaw. I’m turning it into an always-on devbox to use with Codex mobile. Have a feeling this is gonna be amazing. — 3 days
- @sama: i appreciate how seriously the team always takes these reports (even when the answer turns out to be 'i got used to the current level of magic and now i'd like more please') — 3 days
- @sama: also all this: — 3 days
- calcom/cal.diy — Scheduling infrastructure for absolutely everyone. — 3 days
- PBT-Bench: Benchmarking AI Agents on Property-Based Testing — 3 days
- IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation — 3 days
- Calibrating LLMs with Semantic-level Reward — 3 days
- Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design — 3 days
- SEED: Targeted Data Selection by Weighted Independent Set — 3 days
- FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy — 3 days
- Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers — 3 days
- @simonw: Also a great example of positive contribution to open source by wanderingmeow - you don't need to contribute code to have a positive impact, just providing detailed feedback and confirmation that some — 3 days
- iii-hq/iii — Effortlessly compose, extend, and observe every service in real-time for the first time ever. — 3 days
- rohitg00/agentmemory — #1 Persistent memory for AI coding agents based on real-world benchmarks — 3 days
- humanlayer/12-factor-agents — What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers? — 3 days
- CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? — 3 days
- Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces — 3 days
- Recall Isn't Enough: Bounding Commitments in Personalized Language Systems — 3 days
- 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job? — 3 days
- UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation — 3 days
- Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation — 3 days
- Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation — 3 days
- WriteSAE: Sparse Autoencoders for Recurrent State — 3 days
- Membership Inference Attacks on Discrete Diffusion Language Models — 3 days
- Nested Spatio-Temporal Time Series Forecasting — 3 days
- EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control — 3 days
- PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting — 3 days
- Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy — 3 days
- Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training — 3 days
- Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling — 3 days
- Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization — 3 days
- A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders — 3 days
- Stochastic Penalty-Barrier Methods for Constrained Machine Learning — 3 days
- Voice ''Cloning'' is Style Transfer — 3 days
- Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference — 3 days
- SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain — 3 days
- An Approximation Algorithm for Graph Label Selection — 3 days
- Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity — 3 days
- Constrained Policy Optimization via Sampling-Based Weight-Space Projection — 3 days
- DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting — 3 days
- Variational Optimality of F"ollmer Processes in Generative Diffusions — 3 days
- One-Block Transformer (1BT) for EEG-Based Cognitive Workload Assessment — 3 days
- Predicting 3D structure by latent posterior sampling — 3 days
- ast-grep/ast-grep — ⚡A CLI tool for code structural search, lint and rewriting. Written in Rust — 3 days
- nautechsystems/nautilus_trader — Production-grade Rust-native trading engine with deterministic event-driven architecture — 3 days
- Alishahryar1/free-claude-code — Use claude-code for free in the terminal, VSCode extension or discord like OpenClaw (voice supported) — 3 days
- Hallucination as Exploit: Evidence-Carrying Multimodal Agents — 3 days
- Generative Recursive Reasoning — 3 days
- Exact Linear Attention — 3 days
- ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models — 3 days
- Agent Security is a Systems Problem — 3 days
- COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones — 3 days
- RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding — 3 days
- Toward User Comprehension Supports for LLM Agent Skill Specifications — 3 days
- Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models — 3 days
- Unlocking the Potential of Continual Model Merging: An ODE Perspective — 3 days
- ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders — 3 days
- Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition — 3 days
- LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models — 3 days
- FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding — 3 days
- PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling — 3 days
- When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning — 3 days
- IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection — 3 days
- CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs — 3 days
- The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility — 3 days
- Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks — 3 days
- git-ai-project/git-ai — A Git extension for tracking the AI-generated code in your repos — 3 days
- rohitg00/ai-engineering-from-scratch — Learn it. Build it. Ship it for others. — 3 days
- can1357/oh-my-pi — ⌥ AI Coding agent for the terminal — hash-anchored edits, optimized tool harness, LSP, Python, browser, subagents, and more — 3 days
- openai/codex — Lightweight coding agent that runs in your terminal — 3 days
- Lum1104/Understand-Anything — Graphs that teach > graphs that impress. Turn any code into an interactive knowledge graph you can explore, search, and ask questions about. Works with Claude Code, Codex, Cursor, Copilot, Gemini CLI, and more. — 3 days
- Fine-grained Claim-level RAG Benchmark for Law — 3 days
- GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents — 3 days
- Neural Collapse by Design: Learning Class Prototypes on the Hypersphere — 3 days
- AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation — 3 days
- Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification — 3 days
- Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning — 3 days
- Behavior-Consistent Deep Reinforcement Learning — 3 days
- CoarseSoundNet: Building a reliable model for ecological soundscape analysis — 3 days
- Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents — 3 days
- TIP: Token Importance in On-Policy Distillation — 3 days
- BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields — 3 days
- @sama: three of the things we are most excited about: 1. AGI accelerating research 2. AGI accelerating companies 3. personal AGI accelerating everyone in achieving their goals today it was great to announc — 3 days
- agentgateway/agentgateway — Next Generation Agentic Proxy for AI Agents and MCP servers — 3 days
- ai-dynamo/dynamo — A Datacenter Scale Distributed Inference Serving Framework — 3 days