AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Qwen3.6-27B vs Coder-Next — score 88 Sources: reddit/r/LocalLLaMA

Burned about 20 hours of side-by-side compute on my two RTX PRO 6000 Blackwells trying to get a definitive answer on which of these two models was clearly better. As with many things in life, after many tokens and kWhs later the answer was "it depends."

These models in the aggregate are actually cr

🔴 DeepClaude – Claude Code agent loop with DeepSeek V4 Pro — score 75 Sources: hackernews

Developer Tools

🔴 TauricResearch/TradingAgents — TradingAgents: Multi-Agents LLM Financial Trading Framework — score 99 Sources: github_trending

TradingAgents: Multi-Agents LLM Financial Trading Framework

🔴 ruvnet/ruflo — 🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, self-learning swarm intelligence, RAG integration, and native Claude Code / Codex Integration — score 97 Sources: github_trending

🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, self-learning swarm intelligence, RAG integration, and native Claude Code / Codex Integration

🔴 One bash permission slipped... — score 96 Sources: reddit/r/LocalLLaMA

How? It kept getting chained bash commands wrong, with wrong escapes. So it created many bad directories, and tried "fixing" its mistake. It offered to run a large bash command, with rm -rf inside, and stupid me missed it.

I'm glad I push everything often. But the disruption is massive.

FAQ:

🔴 Are modern ML PhDs becoming too incremental, or is this just what research looks like now? [D] — score 94 Sources: reddit/r/MachineLearning

I’ve been thinking about the current state of machine learning PhDs, including my own work, and I’d like to hear how others see it.
My impression is that a large fraction of modern ML PhD work follows a fairly predictable pattern: take an existing idea, connect it to another existing idea, apply i

🔴 Why do AI responses get worse after a while of working on them? And what to do with it — score 93 Sources: reddit/r/AIAgents

AIs have a known problem (it's called context rot): the longer the chat, the worse the responses. Even staying on the same topic. The model begins to confuse old decisions with new ones, re-proposes ideas that have already been discarded, loses the thread of what is current and what is not.

It'

Omitted 8 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🔴 AMD Strix Halo refresh with 192gb! — score 81 Sources: reddit/r/LocalLLaMA

Looks like the next strix halo, the Gorgon halo 495 max will have more then 128gb! I already bought a strix halo mini forms couple months ago since the 2026 refesh rumors was not interesting. Was not planning on getting another till 2027 with the bigger refresh, and linking them together. But was pl

Research Papers

🔴 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors — score 95 Sources: huggingface

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We pre

🔴 Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies — score 85 Sources: huggingface

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datase

🔴 From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills — score 75 Sources: huggingface

LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose mac

Other Signals

🔴 A Qwen finetune, that feels VERY human — score 73 Sources: reddit/r/LocalLLaMA

Hello guys,

So TL;DR, I was asked by multiple people to make an Assistant_Pepe_32B version, but the best base model contender was Qwen3-32B, a model that is very hard to tune on anything other than STEM.

The concept of Assistant_Pepe is an assistant without a typical 'assistant brain', that is

🟡 Notable

Model Releases

🟡 What a time to be alive from 1tk/sec to 20-100tk/sec for huge models — score 65 Sources: reddit/r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1eb6to7/llama_405b_q4_k_m_quantization_running_locally/

[https://www.reddit.com/r/LocalLLaMA/comments/1ebbgkr/llama_31_405b_q5_k_m_runnin

🟡 **[@OpenAI: One week since the launch of GPT-5.5, and it’s already our strongest model launch yet.

API revenue is growing more than 2x faster than any prior release, while Codex doubled revenue in under seven d](https://x.com/OpenAI/status/2050250926888468929)** — score 60 Sources: twitter_rss

One week since the launch of GPT-5.5, and it’s already our strongest model launch yet.

API revenue is growing more than 2x faster than any prior release, while Codex doubled revenue in under seven days as enterprise demand for agentic coding tools keeps climbing.

🟡 **[@xai: Voice Cloning is now live via the xAI API!

Create a custom voice in less than 2 minutes or select from our library of 80+ voices across 28 languages to personalize your voice agents, audiobooks, vide](https://x.com/xai/status/2050355373052223585)** — score 60 Sources: twitter_rss

Voice Cloning is now live via the xAI API!

Create a custom voice in less than 2 minutes or select from our library of 80+ voices across 28 languages to personalize your voice agents, audiobooks, video game characters, and more.

http://x.ai/news/grok-custom-voices

🟡 **[@xai: Introducing Grok Voice Think Fast 1.0

A state-of-the-art voice model built for complex, multi-step workflows with snappy responses and high accuracy.

It takes the top spot on the Tau Voice Bench and](https://x.com/xai/status/2047441173569216721)** — score 60 Sources: twitter_rss

Introducing Grok Voice Think Fast 1.0

A state-of-the-art voice model built for complex, multi-step workflows with snappy responses and high accuracy.

It takes the top spot on the Tau Voice Bench and handles real-world messiness like noise, accents, and interruptions better than any other model in

🟡 **[@MistralAI: 🆕 Today, we're releasing the public preview of Workflows, the orchestration layer for enterprise AI.

🌎 Enterprise teams have capable models. What they don't have is a way to run them reliably in prod](https://x.com/MistralAI/status/2049128071874179091)** — score 60 Sources: twitter_rss

🆕 Today, we're releasing the public preview of Workflows, the orchestration layer for enterprise AI.

🌎 Enterprise teams have capable models. What they don't have is a way to run them reliably in production. That's the gap Workflows fills. It takes AI-powered business processes from prototype to pro

Omitted 3 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

🟡 LearningCircuit/local-deep-research — Local Deep Research achieves ~95% on SimpleQA benchmark (tested with Qwen 3.6). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted. — score 66 Sources: github_trending

Local Deep Research achieves ~95% on SimpleQA benchmark (tested with Qwen 3.6). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.

🟡 iOfficeAI/AionUi — Free, local, open-source 24/7 Cowork app and OpenClaw for Gemini CLI, Claude Code, Codex, OpenCode, Qwen Code, Goose CLI, Auggie, and more | 🌟 Star if you like it! — score 64 Sources: github_trending

Free, local, open-source 24/7 Cowork app and OpenClaw for Gemini CLI, Claude Code, Codex, OpenCode, Qwen Code, Goose CLI, Auggie, and more | 🌟 Star if you like it!

🟡 harvard-edge/cs249r_book — Machine Learning Systems — score 61 Sources: github_trending

Machine Learning Systems

🟡 Spent 6 months building one platform that replaces my LLM proxy + agent framework + workflow engine + observability stack - sharing before I keep adding features forever — score 59 Sources: reddit/r/AIAgents

Motivation: I wanted one tool that handles every aspect of building an agent. Didn't want to pay for a stack of products (LiteLLM, n8n, LangSmith, etc.) and didn't want five dashboards, five auth setups, and traces that don't connect across layers. We're already dependent on the model providers

🟡 Q00/ouroboros — Agent OS: Stop prompting. Start specifying. — score 59 Sources: github_trending

Agent OS: Stop prompting. Start specifying.

Omitted 13 additional developer tools items from the main section; see raw data and source-specific sections below.

Research Papers

🟡 Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction — score 65 Sources: huggingface

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entit

🟡 Online Self-Calibration Against Hallucination in Vision-Language Models — score 55 Sources: huggingface · arxiv/cs.LG

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introdu

🟡 LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation — score 55 Sources: huggingface · arxiv/cs.CL

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Ta

🟡 Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization — score 40 Sources: huggingface

Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns,

🟡 AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval — score 40 Sources: huggingface

Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality

Omitted 1 additional research papers items from the main section; see raw data and source-specific sections below.

Other Signals

🟡 Anyone submit ML articles to ACM journals (eg. TOPML or TIST)? [D] — score 69 Sources: reddit/r/MachineLearning

Have any of you submitted ML articles to ACM journals (eg. TOPML or TIST)? How long did the process take, and were the reviews high-quality? How does it compare to other journals (eg. TMLR) in terms of difficulty? Thanks.

🟢 Incremental

Model Releases

🟢 Open source models are going to be the future on Cursor, OpenCode etc. — score 35 Sources: reddit/r/LocalLLaMA

I just wanted to share my experience. At work we have Cursor with the Enterprise tier. Today I burned 10$ with 2 prompts, one on gpt-5.5 and one on claude-opus-4.6-thinking. Last month I burned 80$ in one week with claude-opus-4.7 even with the 50% off they had with the launch. If they continue with

🟢 Excellent discussion about LLM scaling [D] — score 25 Sources: reddit/r/MachineLearning

I came across an excellent in depth discussion of memory and compute scaling analysis for LLMs. One takeaway is that running LLMs locally or on private cloud is wasteful. Memory / compute scaling makes large batching during inference very efficient.

Highly recommend. [How GPT, Claude, and Gemini

Developer Tools

🟢 njbrake/agent-of-empires — Manage multiple Claude Code, OpenCode agents from either TUI or Web for easy access on mobile. Also supports Mistral Vibe, Codex CLI, Gemini CLI, Pi.dev, Copilot CLI, Factory Droid Coding. Uses tmux and git worktrees. — score 36 Sources: github_trending

Manage multiple Claude Code, OpenCode agents from either TUI or Web for easy access on mobile. Also supports Mistral Vibe, Codex CLI, Gemini CLI, Pi.dev, Copilot CLI, Factory Droid Coding. Uses tmux and git worktrees.

🟢 xingkongliang/skills-manager — A lightweight desktop app to manage, sync, and organize AI agent skills across 15+ coding tools — Cursor, Claude Code, Codex, Copilot, and more. — score 34 Sources: github_trending

A lightweight desktop app to manage, sync, and organize AI agent skills across 15+ coding tools — Cursor, Claude Code, Codex, Copilot, and more.

🟢 nexu-io/nexu — The simplest desktop client for OpenClaw 🦞 — bridge your Agent to WeChat, Feishu, Slack & Discord in one click. Works with Claude Code, Codex & any LLM. BYOK, Oauth, local-first, chat from your phone 24/7. — score 24 Sources: github_trending

The simplest desktop client for OpenClaw 🦞 — bridge your Agent to WeChat, Feishu, Slack & Discord in one click. Works with Claude Code, Codex & any LLM. BYOK, Oauth, local-first, chat from your phone 24/7.

🟢 should agentic systems have models specialized only for code? — score 21 Sources: reddit/r/AIAgents

Most current agents feel like they rely on one big general-purpose model for everything, planning, reasoning, and actually writing code. but coding is a different beast compared to normal text.

what if we had dedicated coding models inside the agent stack? one model trained only for code understand

🟢 Xiaomi mimo coding plan is a absolute scam/misleading marketing — score 21 Sources: reddit/r/AIAgents

They say on their page it is 1.6 billion credit and mimo v2.5 pro takes 2 credit per token, mimo v2.5 takes 1 credit per token but here is how they get you, cached token is still billed the same credit per round trip, absolutely not suitable for coding cli then, because every single one of them by d

Omitted 3 additional developer tools items from the main section; see raw data and source-specific sections below.

Research Papers

🟢 Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling — score 10 Sources: huggingface

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled man

Other Signals

🟢 Mistral-Medium-3.5-128B-Q3_K_M on 3x3090 (72GB VRAM) — score 27 Sources: reddit/r/LocalLLaMA

Here is the actual speed of Mistral Medium Q3 running locally on 3x3090

first some Python

https://preview.redd.it/3blnqya7o0zg1.png?width=1670&format=png&auto=webp&s=bab477f9889c16558044ccebb22e3ebfb6a56118

https://preview.redd.it/76a3j6u7o0zg1.png?width=1620&format=png&auto=w

🟢 Built a LangChain middleware that enforces signed authorization receipts before every tool call. Here is why wrap_tool_call is the right enforcement point. — score 27 Sources: reddit/r/AIAgents

Been building a pre-execution authorization layer for Al agents. The core idea is that a signed delegation receipt needs to exist before any tool call executes. Not a policy. Not a system prompt. A cryptographic constraint the agent cannot reason around.
ror LangChain specifically wrap-
tool_ca

🟢 UAI Reviews disappeared [D] — score 25 Sources: reddit/r/MachineLearning

Did everyone else’s reviews disappear on their submissions?

🟢 OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors — score 25 Sources: hackernews

🟢 Where should you invest in the coming 25 years? Check the screenshot — score 21 Sources: reddit/r/AIAgents

While I am building for the future at Layout.dev (Acquired by Incorta), I thought what areas should I invest as a development for my kids so that they would have good opportunities in their professional careers.

So I went to [https://layout.dev](h

Omitted 3 additional other signals items from the main section; see raw data and source-specific sections below.

Repo	Description	Stars Today	Language
TauricResearch/TradingAgents	TradingAgents: Multi-Agents LLM Financial Trading Framework	3313	python
ruvnet/ruflo	🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, self-learning swarm intelligence, RAG integration, and native Claude Code / Codex Integration	1840	typescript
1jehuang/jcode	Coding Agent Harness	591	rust
AIDC-AI/Pixelle-Video	🚀 AI 全自动短视频引擎 \| AI Fully Automated Short Video Engine	497	python
firecrawl/firecrawl	🔥 The API to search, scrape, and interact with the web for AI	462	typescript
virattt/dexter	An autonomous agent for deep financial research	418	typescript
Hmbown/DeepSeek-TUI	Coding agent for DeepSeek models that runs in your terminal	343	rust
czlonkowski/n8n-mcp	A MCP for Claude Desktop / Claude Code / Windsurf / Cursor to build n8n workflows for you	282	typescript
cocoindex-io/cocoindex	Incremental engine for long horizon agents 🌟 Star if you like it!	163	python
LearningCircuit/local-deep-research	Local Deep Research achieves ~95% on SimpleQA benchmark (tested with Qwen 3.6). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted.	143	python

📄 New Papers

Title	Category	Hotness	Link
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors	research_paper	61	Open
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies	research_paper	7	Open
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills	research_paper	6	Open
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction	research_paper	3	Open
Online Self-Calibration Against Hallucination in Vision-Language Models	research_paper	2	Open
LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation	research_paper	2	Open
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data	cs.AI	0	Open
AgentReputation: A Decentralized Agentic AI Reputation Framework	cs.AI	0	Open
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models	cs.AI	0	Open
Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents	cs.AI	0	Open
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization	cs.AI	0	Open
ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts	cs.AI	0	Open
Causal Foundations of Collective Agency	cs.AI	0	Open
Agentic AI for Trip Planning Optimization Application	cs.AI	0	Open
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference	cs.AI	0	Open

🐦 Twitter/X Highlights

Account	Tweet Summary
OpenAI	One week since the launch of GPT-5.5, and it’s already our strongest model launch yet. API revenue is growing more than 2x faster than any prior release, while Codex doubled revenue in under seven days as enterprise demand for agentic coding tools keeps climbing. Post
xai	Voice Cloning is now live via the xAI API! Create a custom voice in less than 2 minutes or select from our library of 80+ voices across 28 languages to personalize your voice agents, audiobooks, video game characters, and more. http://x.ai/news/grok-custom-voices Post
xai	Introducing Grok Voice Think Fast 1.0 A state-of-the-art voice model built for complex, multi-step workflows with snappy responses and high accuracy. It takes the top spot on the Tau Voice Bench and handles real-world messiness like noise, accents, and interruptions better than any other model in the world. https://x.ai/news/grok-voice-think-fast-1 Post
MistralAI	🆕 Today, we're releasing the public preview of Workflows, the orchestration layer for enterprise AI. 🌎 Enterprise teams have capable models. What they don't have is a way to run them reliably in production. That's the gap Workflows fills. It takes AI-powered business processes from prototype to production, with the durability, observability, and fault tolerance that production actually requires. Leading organisations like ASML, ABANCA, CMA-CGM, France Travail, La Banque Postale, Moeve, and many others are already using Workflows to automate critical processes. Post
OpenAI	Bring your workflow to Codex in just a few clicks. Import settings, plugins, agents, project configuration, and more so you can keep working with fewer interruptions. Your move. Post
MistralAI	Mistral AI made the TIME100 Most Influential Companies list for 2026 — and the top 10 for AI. Why we're proud: customers run frontier models in production on their own terms, on their own infrastructure. Thank you to our customers for their trust and for joining us on the journey. Grateful to our incredible team members around the world and congrats to all the businesses recognized this year. Learn more at: https://time.com/collection/time100-most-influential-companies/2026/mistral/ #TIME100Companies #TIME100CompaniesIndustryLeader Post
karpathy	Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights: The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). Three examples of new horizons: 1. menugen: an app that can be fully engulfed by LLMs, with no classical code needed: input an image, output an image and an LLM can natively do the thing. 2. install .md skills instead of install .sh scripts. Why create a complex Software 1.0 bash script for e.g. installing a piece of software if you can write the installation out in words and say "just show this to your LLM". The LLM is an advanced interpreter of English and can intelligently target installation to your setup, debug everything inline, etc. 3. LLM knowledge bases as an example of something that was impossible with classical code because it's computation over unstructured data (knowledge) from arbitrary sources and in arbitrary formats, including simply text articles etc. I pushed on these because in every new paradigm change, the obvious things are always in the realm of speeding up or somehow improving what existed, but here we have examples of functionality that either suddenly perhaps shouldn't even exist (1,2), or was fundamentally not possible before (3). The second (ongoing) theme is trying to explain the pattern of jaggedness in LLMs. How it can be true that a single artifact will simultaneously 1) coherently refactor a 100,000-line code base and 2) tell you to walk to the car wash to wash your car. I previously wrote about the source of this as having to do with verifiability of a domain, here I expand on this as having to also do with economics because revenue/TAM dictates what the frontier labs choose to package into training data distributions during RL. You're either in the data distribution (on the rails of the RL circuits) and flying or you're off-roading in the jungle with a machete, in relative terms. Still not 100% satisfied with this, but it's an ongoing struggle to build an accurate model of LLM capabilities if you wish to practically take advantage of their power while avoiding their pitfalls, which brings me to... Last theme is the agent-native economy. The decomposition of products and services into sensors, actuators and logic (split up across all of 1.0/2.0/3.0 computing paradigms), how we can make information maximally legible to LLMs, some words on the quickly emerging agentic engineering and its skill set, related hiring practices, etc., possibly even hints/dreams of fully neural computing handling the vast majority of computation with some help from (classical) CPU coprocessors. Post
simonw	I released LLM 0.32a0 this morning, a major backwards-compatible refactor of my LLM Python library and CLI tool for working with language models - the new changes should help LLM work better with reasoning models and other new frontier capabilities https://simonwillison.net/2026/Apr/29/llm/ Post

AI Watchtower Briefing — 2026-05-04

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟡 Notable

Model Releases

Developer Tools

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Research Papers

Other Signals

📈 Trending Repos

📄 New Papers

🐦 Twitter/X Highlights