Weekly Narrative
This week’s AI stack continued to move from “model demo” toward deployable agent systems, with the strongest signals clustering around orchestration, memory, tool use, benchmarks, and local inference constraints.
On the product side, Mistral previewed Workflows, framing it as an orchestration layer for enterprise AI: not another model release, but infrastructure for running capable models reliably in production. That sits neatly beside Anthropic’s new public-facing repositories: anthropics/financial-services, which suggests domain packaging for regulated workflows, and anthropics/skills, a public repository for reusable agent skills. The direction is clear: vendors are trying to turn model capability into composable operating procedures. xAI pushed in a similar applied direction with Grok Voice Think Fast 1.0 for customer-support voice agents and an API Image Generation Quality Mode, emphasizing robustness in noisy environments and higher-fidelity media generation rather than leaderboard abstraction.
Open-source activity is converging on agent workbenches. NousResearch/hermes-agent repeated across the week as “the agent that grows with you,” while rowboatlabs/rowboat describes an “AI coworker, with memory,” and rohitg00/agentmemory claims persistent memory for coding agents based on real-world benchmarks. memvid/memvid takes a more infrastructure-shaped angle: a serverless, single-file memory layer meant to replace more complex RAG pipelines. K-Dense-AI/scientific-agent-skills extends the same idea into ready-to-use skills for research, engineering, finance, analysis, and writing. The agent surface is also fragmenting into manager shells and switching tools: cc-switch, Codex-Manager, kiro-account-manager, and kiro.rs all point to a world where developers juggle multiple coding agents, accounts, quotas, and local gateways.
Coding-agent tooling remained especially active. github/spec-kit pushed spec-driven development as a first-class workflow, while garrytan/gstack packaged an opinionated Claude Code setup with roles such as CEO, designer, engineering manager, release manager, documentation engineer, and QA. millionco/react-doctor targets a narrower failure mode: catching bad React emitted by agents. OpenAI’s “Codex is now in the ChatGPT mobile app” signal matters less as a UI feature than as a workflow change: coding agents are becoming persistent remote tasks that can be monitored, steered, and approved away from the terminal.
The research literature sharpened the same questions. Workspace-Bench 1.0 evaluates agents on workspace tasks with large-scale file dependencies, which is exactly where many current coding agents are brittle. SREGym introduces live, high-fidelity failure scenarios for SRE agents. Agentick proposes a unified benchmark for general sequential decision-making agents, while FORTIS benchmarks over-privilege in agent skills and MCPShield targets content-aware attack detection in agent tool-call traffic. Security work also showed up in DTap, a controllable interactive red-teaming platform for agents, and Metis, which studies self-evolving jailbreak policies. The common technical theme is that tool-using agents need evaluation at the level of trajectories, privileges, and operational failure, not only final answers.
Several papers probed reasoning and planning internals. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning directly questions whether reasoning traces correspond to useful long-horizon search. TMAS explores scaling test-time compute via multi-agent synergy, while Predictive Maps of Multi-Agent Reasoning models communication topologies through successor representations. Asymmetric On-Policy Distillation, Flow-OPD, TMPO, TokenRatio, GEAR, and Beyond GRPO and On-Policy Distillation all sit in the post-training optimization lane: improving policy behavior through token-level, trajectory-level, sparse-to-dense, or distillation-based objectives. The field is still searching for cleaner ways to spend inference and training compute on reasoning without simply inflating traces.
Long-context and efficient inference also had a busy week. Long Context Pre-Training with Lighthouse Attention proposes a training-only selection mechanism to reduce the quadratic bottleneck at extreme sequence lengths. Memory-Efficient Looped Transformer decouples compute from memory in recurrent-style language models. Pretraining large language models with MXFP4 on Native FP4 Hardware and community discussion around the full DeepSeek V4 paper’s FP4 QAT details show quantization moving deeper into training rather than remaining a serving-only trick. On the local side, the community posted concrete configurations: Qwen3.6 35B A3B at 80 tok/s with 128K context on 12GB VRAM using llama.cpp MTP, a 1T-parameter Kimi K2.5 build over Intel Optane persistent memory at roughly 4 tok/s, and even TinyStories-260K running on a stock Game Boy Color. These are not all practical baselines, but they map the edges of the deployment envelope.
Multimodal and embodied work filled in the rest of the frontier. Google DeepMind highlighted AlphaEvolve, a Gemini-powered coding agent for algorithmic progress, and a partnership with EVE Online as a sandbox for complex agent behavior. Papers such as One Token Per Frame, CoWorld-VLA, D-VLA, MMSkills, SimWorld Studio, and MobileEgo Anywhere all push toward agents that perceive, plan, and act over longer horizons. Meanwhile, CloakBrowser claiming a 30/30 bot-detection pass rate as a Playwright replacement is a reminder that agent infrastructure also collides with the web’s anti-automation defenses.
The week’s technical center of gravity was not a single model. It was the hardening of the agent substrate: memory, orchestration, skills, local serving, benchmarks, privilege boundaries, and post-training methods that try to make long-horizon behavior less accidental.
Recurring Titles
- CloakHQ/CloakBrowser — Stealth Chromium that passes every bot detection test. Drop-in Playwright replacement with source-level fingerprint patches. 30/30 tests passed. — 7 days
- rohitg00/agentmemory — #1 Persistent memory for AI coding agents based on real-world benchmarks — 7 days
- @MistralAI: 🆕 Today, we're releasing the public preview of Workflows, the orchestration layer for enterprise AI. 🌎 Enterprise teams have capable models. What they don't have is a way to run them reliably in prod — 7 days
- @xai: Your customer support needs a voice agent built for the real world. Grok Voice Think Fast 1.0 handles complex workflows with speed and accuracy, even in hard-to-hear environments. From multi-step tro — 7 days
- @MistralAI: Mistral AI made the TIME100 Most Influential Companies list for 2026 — and the top 10 for AI. Why we're proud: customers run frontier models in production on their own terms, on their own infrastruct — 7 days
- @karpathy: This is the the quote I've been citing a lot recently. — 7 days
- @karpathy: Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights: The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). T — 7 days
- @ilyasut: It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and — 7 days
- @ilyasut: One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing. — 7 days
- @ilyasut: Important work — 7 days
- @ilyasut: truly the greatest day ever🎗️ — 7 days
- @ilyasut: a revolutionary breakthrough if i've ever seen one — 7 days
- millionco/react-doctor — Your agent writes bad React. This catches it — 6 days
- @xai: Image Generation Quality Mode is now available on the xAI API. This model has already powered the generation of over 300 million images on Grok. It brings higher realism, stronger text rendering, — 6 days
- yikart/AiToEarn — Let's use AI to Earn! — 6 days
- tinyhumansai/openhuman — Your Personal AI super intelligence. Private, Simple and extremely powerful. — 6 days
- @mattshumer_: I underestimated the pace of progress. — 6 days
- github/spec-kit — 💫 Toolkit to help you get started with Spec-Driven Development — 5 days
- @mattshumer_: Claude down? — 5 days
- @mattshumer_: There are no good bagels on the UWS. If someone opens a solid shop, they’re gonna make a killing. — 5 days
- NousResearch/hermes-agent — The agent that grows with you — 5 days
- Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning — 5 days
- garrytan/gstack — Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA — 5 days
- @AnthropicAI: Claude's Constitution is now an audiobook, read by two of its authors, Amanda Askell and Joe Carlsmith. It includes a Q&A on the writing process, the philosophies that shaped the document, and how it — 5 days
- @karpathy: This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to prese — 5 days
- @simonw: Wrote about today's GitLab restructuring / "workforce reduction" announcement, and ended up digging around in version control for both the GitLab and the 37signals public employee handbooks to help il — 5 days
- datawhalechina/hello-agents — 📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程 — 4 days
- HKUDS/AI-Trader — "AI-Trader: 100% Fully-Automated Agent-Native Trading" — 4 days
- bytedance/UI-TARS-desktop — The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra — 4 days
- @mattshumer_: Totally false. Who still thinks this? — 4 days
- farion1231/cc-switch — A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Gemini CLI & Hermes Agent. Only official website: ccswitch.io — 4 days
- THINKSAFE: Self-Generated Safety Alignment for Reasoning Models — 4 days
- BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models — 4 days
- IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs — 4 days
- GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing — 4 days
- Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies — 4 days
- @simonw: My Mac had less available memory than I expected, turned out the "claude" Claude Code processes on this machine (running in various terminal windows) were consuming ~30GB on their own! The largest on — 4 days
- danielmiessler/Personal_AI_Infrastructure — Agentic AI Infrastructure for magnifying HUMAN capabilities. — 4 days
- A New Technique for AI Explainability using Feature Association Map — 4 days
- anthropics/financial-services — 3 days
- playcanvas/supersplat — 3D Gaussian Splat Editor — 3 days
- rowboatlabs/rowboat — Open-source AI coworker, with memory — 3 days
- @GoogleDeepMind: Algorithms are part of nearly every aspect of life, from the physics of the natural world to planning shipping routes. Our Gemini-powered coding agent AlphaEvolve has been accelerating progress over — 3 days
- @GoogleDeepMind: Pinned: We’re partnering with the developers of @EveOnline to explore the next frontier of AI research in games. EVE's complex, player-driven universe is the perfect safe sandbox to test agents on me — 3 days
- qxcnm/Codex-Manager — 一个Codex cli 账号管理与切换工具。为 Codex cli提供本地网关转发。 — 3 days
- HelixDB/helix-db — HelixDB is an open-source graph-vector database built from scratch in Rust. — 3 days
- anthropics/skills — Public repository for Agent Skills — 3 days
- jundot/omlx — LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar — 3 days
- Agentick: A Unified Benchmark for General Sequential Decision-Making Agents — 3 days
- SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios — 3 days
- Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding — 3 days
- MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries — 3 days
- Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization — 3 days
- Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning — 3 days
- One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy — 3 days
- Flow-OPD: On-Policy Distillation for Flow Matching Models — 3 days
- Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics — 3 days
- Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level — 3 days
- AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents — 3 days
- CktFormalizer: Autoformalization of Natural Language into Circuit Representations — 3 days
- Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning — 3 days
- @simonw: Shopify's River agent system lives in Slack and can only be used in public so that other employees can learn from what you do with it Reminds me of how Midjourney's Discord-only launch helped people — 3 days
- hank9999/kiro.rs — A Kiro Client in Rust — 3 days
- FORTIS: Benchmarking Over-Privilege in Agent Skills — 3 days
- MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs — 3 days
- Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents — 3 days
- PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting — 3 days
- UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence — 3 days
- Strategic commitments shape collective cybersecurity under AI inequality — 3 days
- SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning — 3 days
- expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling — 3 days
- When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models — 3 days
- Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors — 3 days
- AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems — 3 days
- Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport — 3 days
- Evolutionary Ensemble of Agents — 3 days
- FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models — 3 days
- Combining Mechanical and Agentic Specification Inference for Move — 3 days
- Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization — 3 days
- CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving — 3 days
- Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions — 3 days
- LoKA: Low-precision Kernel Applications for Recommendation Models At Scale — 3 days
- Shields to Guarantee Probabilistic Safety in MDPs — 3 days
- Context Learning for Multi-Agent Discussion — 3 days
- PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering — 3 days
- Interactive Benchmarks — 3 days
- Where Do Reasoning Models Refuse? — 3 days
- GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives — 3 days
- MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents — 3 days
- DeepL'evy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series — 3 days
- @simonw: New TIL: I figured out how to use my LLM CLI tool in a shebang line, which means you can write executable scripts in English, or hook up more complex scripts with a snippet of YAML template — 3 days
- @sama: speaking of things that have gotten over a threshold for me, the combo of the new ChatGPT model, personality, and personalization feels like a new thing — 3 days
- @sama: would you call it a superapp? — 3 days
- hj01857655/kiro-account-manager — 🚀 智能管理 Kiro IDE 账号,一键切换,配额监控 | 官网:https://kiro-website-six.vercel.app — 3 days
- zizmorcore/zizmor — Static analysis for GitHub Actions — 3 days
- Selective Off-Policy Reference Tuning with Plan Guidance — 3 days
- Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion — 3 days
- TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment — 3 days
- MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic — 3 days
- HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series — 3 days
- LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection — 3 days
- Gradient-Free Noise Optimization for Reward Alignment in Generative Models — 3 days
- Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies — 3 days
- Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models — 3 days
- Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation — 3 days
- GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation — 3 days
- Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation — 3 days
- TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching — 3 days
- Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory — 3 days
- Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training — 3 days
- Learning, Fast and Slow: Towards LLMs That Adapt Continually — 3 days
- Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents — 3 days
- CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making — 3 days
- MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware — 3 days
- memvid/memvid — Memory layer for AI Agents. Replace complex RAG pipelines with a serverless, single-file memory layer. Give your agents instant retrieval and long-term memory. — 3 days
- juspay/hyperswitch — An open source payments switch written in Rust to make payments fast, reliable and affordable — 3 days
- base/base — All components used to run Base — 3 days
- CodebuffAI/codebuff — Generate code from the terminal! — 3 days
- K-Dense-AI/scientific-agent-skills — A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing. — 3 days
- ArthurBrussee/brush — 3D Reconstruction for all — 3 days
- Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue — 3 days
- D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models — 3 days
- MMSkills: Towards Multimodal Skills for General Visual Agents — 3 days
- WriteSAE: Sparse Autoencoders for Recurrent State — 3 days
- Context Training with Active Information Seeking — 3 days
- Scaling few-shot spoken word classification with generative meta-continual learning — 3 days
- Does language matter for spoken word classification? A multilingual generative meta-learning approach — 3 days
- Watermarking Should Be Treated as a Monitoring Primitive — 3 days
- MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing — 3 days
- LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving — 3 days
- Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models — 3 days
- Constitutional Governance in Metric Spaces — 3 days
- Query-Conditioned Test-Time Self-Training for Large Language Models — 3 days
- (How) Do Large Language Models Understand High-Level Message Sequence Charts? — 3 days
- ENSEMBITS: an alphabet of protein conformational ensembles — 3 days
- Higher-order Linear Attention — 3 days
- The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma — 3 days
- Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2 — 3 days
- Pretraining large language models with MXFP4 on Native FP4 Hardware — 3 days
- @OpenAI: Another reason to switch to Codex. — 3 days
- @simonw: Doing this is a great way to make a bonfire of your reputation — 3 days
- @sama: being a dad is the thing that has most exceeded already-high-expectations in my whole life — 3 days
- NVIDIA/OpenShell — OpenShell is the safe, private runtime for autonomous AI agents. — 3 days
- ton-blockchain/acton — Toolchain for TON smart contract development and beyond — 3 days