AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Gemma 4 MTP released — score 96 Sources: reddit/r/LocalLLaMA

Blog post:

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

MTP draft models:

[https://huggingface.co/google/gemma-4-31B-it-assistant](https://hugg

🔴 Sr Software Engineer - Haven't written a line of code in months — score 83 Sources: reddit/r/AIAgents

AI has reached the point that I no longer write code.

I used to work in shops where I was deep in the debugger without internet access; now I just drive intent and long term engineering decisions with Claude/Codex/Perplexity. I work at a mid-sized startup with a bit over one-hundred people.

I just

Developer Tools

🔴 How are you pricing without lighting money on fire? — score 72 Sources: reddit/r/AIAgents

Curious how everyone is pricing their AI agents right now. Are you going outcome based (only paying when the agent actually delivers), flat monthly fees, or straight usage based pricing tied to tokens or actions?

And if you're doing usage based or outcome based, how are you handling the inference c

Infrastructure & Compute

🔴 Struggling to reproduce paper results before improving them — stuck below reported accuracy [R] — score 94 Sources: reddit/r/MachineLearning

I’m a PhD student working in AI/computer vision, and I’ve hit a frustrating wall with a project.

My supervisor asked me to improve the accuracy of a published paper. My first step has been to faithfully reproduce their results before trying any modifications. The issue is I can’t even match their r

🔴 Accelerating Gemma 4: faster inference with multi-token prediction drafters — score 81 Sources: hackernews

🔴 cheahjs/free-llm-api-resources — A list of free LLM inference resources accessible via API. — score 71 Sources: github_trending

A list of free LLM inference resources accessible via API.

Research Papers

🔴 PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination — score 85 Sources: huggingface

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inh

🔴 X2SAM: Any Segmentation in Images and Videos — score 82 Sources: huggingface · arxiv/cs.AI

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-lev

Other Signals

🔴 DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid. — score 88 Sources: reddit/r/LocalLLaMA

That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?

Ran my normal coding workflow for 10 days. every task got logged: what it was, token

🔴 Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more — score 81 Sources: reddit/r/LocalLLaMA

Dear fellow Llamas, it is my distinct pleasure to announce the immediate availability of version 1.3 of Heretic (https://github.com/p-e-w/heretic), the leading software for removing censorship from language models.

This was a long and eventful release cycle, during which Heretic became a high-p

🔴 NeurIPS Submission Number [D] — score 81 Sources: reddit/r/MachineLearning

Hey guys,

Just saw that NeurIPS this year might be exceeding 40k, what submission number did you get? The max I know of was 29k, that was 24 hours ago

🟡 Notable

Model Releases

🟡 **[@xai: Grok 4.3 is now live on the xAI API. It’s our fastest, most intelligent model to date.

It tops the @ArtificialAnlys leaderboards in agentic tool calling and instruction following, and ranks #1 in @Va](https://x.com/xai/status/2051703217697010103)** — score 60 Sources: twitter_rss

Grok 4.3 is now live on the xAI API. It’s our fastest, most intelligent model to date.

It tops the @ArtificialAnlys leaderboards in agentic tool calling and instruction following, and ranks #1 in @ValsAI enterprise domains like case law and corporate finance.

Grok 4.3 supports a 1 million token co

🟡 **[@OpenAI: Pinned: GPT-5.5 Instant is starting to roll out in ChatGPT.

It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone.

And it's also more concise,](https://x.com/OpenAI/status/2051709028250915275)** — score 50 Sources: twitter_rss

Pinned: GPT-5.5 Instant is starting to roll out in ChatGPT.

It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone.

And it's also more concise, which we heard you wanted. We think you'll love chatting with it.

🟡 MTP on strix halo with llama.cpp (PR #22673) — score 42 Sources: reddit/r/LocalLLaMA

I saw a post about incoming MTP support in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000:
I rebuilt the radv container from https://github.com/kyuz0/amd-strix-halo-toolboxes with that PR : [https://github.com/ggml-org/llama.cp

Developer Tools

🟡 Agents can now create Cloudflare accounts, buy domains, and deploy — score 69 Sources: hackernews

🟡 bytedance/deer-flow — An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of tasks that could take minutes to hours. — score 68 Sources: github_trending

An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of tasks that could take minutes to hours.

🟡 Arindam200/awesome-ai-apps — A collection of projects showcasing RAG, agents, workflows, and other AI use cases — score 62 Sources: github_trending

A collection of projects showcasing RAG, agents, workflows, and other AI use cases

🟡 ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) — score 58 Sources: reddit/r/LocalLLaMA

There's been quite a few case studies recently on agents building whole programs from scratch, but most of them test a single or just a few projects with hand-tuned setups.

We've spent the last couple of months formalizing this setting and building a benchmark of 200 tasks while doubling down on te

🟡 Prompt evals are not enough once an agent starts taking actions — score 56 Sources: reddit/r/AIAgents

One thing I keep running into with AI agents is that testing the prompt is only a small part of the problem.

An agent can give a decent response in a simple test and still break once it has to move through a real workflow.

The weird failures usually show up when it has to:

remember context acro

Omitted 4 additional developer tools items from the main section; see raw data and source-specific sections below.

Research Papers

🟡 StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing — score 58 Sources: huggingface · arxiv/cs.LG

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram context mixing and arithmetic coding. The model is initialised from scratch and trained token-by-token on the file being compressed, requiring no pre

🟡 The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail — score 45 Sources: huggingface

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) ach

🟡 ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue — score 45 Sources: huggingface

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV

Other Signals

🟡 Transformers with Selective Access to Early Representations [R] — score 69 Sources: reddit/r/MachineLearning · arxiv/cs.LG

Hello everyone. I’m excited to share our new paper!

Figure 1: Comparison Across Architectures

A lot of recent Transformer variants try to improve information flow acr

🟡 Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) — score 65 Sources: reddit/r/LocalLLaMA

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.

WHAT WE ARE TESTING

First, the prompt:

Given this PGN stri

🟡 Production AI very different from the demos [D] — score 56 Sources: reddit/r/MachineLearning

Moved an AI feature into production a few months ago and the cost profile has been a constant surprise since so the demos and the early prototypes ran cheap because the volume was tiny + the prompts were short but when it hit traffic the token usage scaled a lot. I think it was partly because custom

🟡 Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. — score 50 Sources: reddit/r/LocalLLaMA

Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article... and it did not disappoint. Lots of free info in the link. The juicy part is behind a paywall. I'll respect that, but the short of it is:

It's showing that the Qwen's are more benchmaxxed, and Ge

🟢 Incremental

Model Releases

🟢 Tired of copy-pasting prompts between Claude and Codex tabs: built a small file-backed queue that automates the handoff — score 33 Sources: reddit/r/AIAgents

I've been working on agent-lanes
https://github.com/leo-diehl/agent-lanes

A small Python tool that lets one AI coding agent hand work to another over a shared folder. The queue is just JSON files on disk: no daemon, no server, no network.

Think o

🟢 Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama — score 27 Sources: reddit/r/LocalLLaMA

🟢 Qwen 3.6 27B MTP on v100 32GB: 54 t/s — score 19 Sources: reddit/r/LocalLLaMA

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

Tested using am17an's MTP GGUF, q8_0 kv cache and 200k cache limit acting as vscode

🟢 Solidity LM surpasses Opus — score 4 Sources: reddit/r/LocalLLaMA

My weekend project overran a little but happy with the end result.

soleval pass@1 beat Opus 4.7 on the same set of tasks. Some more work to be done here but any feedback is welcome, I spent quite a lot of time (and money) on this one!

https://huggingface.co/samscrack/Qwen3.6-Solidity-27B

Developer Tools

🟢 Question about PLS-DA hyperparameter tuning [R] — score 38 Sources: reddit/r/MachineLearning

Hi all! I am a bioinformatician and I am working on learning some ML tools for some disease/biomarker stuff. I am working with sparse PLS-DA at the moment. Before actually tuning the model, I run on overall global model (without sparsity) to get an idea of what my data looks like and to get to a sta

🟢 PriorLabs/TabPFN — ⚡ TabPFN: Foundation Model for Tabular Data ⚡ — score 38 Sources: github_trending

⚡ TabPFN: Foundation Model for Tabular Data ⚡

🟢 Early attempt at tracking agent work across the economy — score 33 Sources: reddit/r/AIAgents

I made an Agent Economy tracker and would love feedback!

It’s an early attempt to track how agent work could show up across the economy: agent GDP, deployed agent employment, revenue, stack costs, and productivity.

Curious what people here think, especially if you’re already using agents seriously

🟢 GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents — score 31 Sources: hackernews

🟢 Show HN: Airbyte Agents – context for agents across multiple data sources — score 19 Sources: hackernews

Omitted 2 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🟢 TritonSigmoid: A fast, padding-aware sigmoid attention kernel for GPUs [R] — score 18 Sources: reddit/r/MachineLearning

We are open-sourcing TritonSigmoid — a fast, padding-aware sigmoid attention kernel for GPUs.

We built this for single-cell foundation models, where every cell is represented as a sequence of genes. A single gene can be regulated by multiple transcription factors at once. Softmax forces them to com

🟢 Competition - League of Robot Runners 2026: Multi-robot coordination under uncertainty [N] — score 6 Sources: reddit/r/MachineLearning

Hello ML and RL community

We are inviting participants to the League of Robot Runners (LoRR) 2026: https://www.leagueofrobotrunners.org

Co-located with AAMAS 2026, LoRR is a research competition on large-scale multi-robot coordination. These are important pr

Research Papers

🟢 Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO — score 10 Sources: huggingface

We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within Large Language Model (LLM)-based agents. Addressing the current fragmentation of the skill ecosystem, Skills-Coach explores the boundaries of skill capabilities, thereby facil

Other Signals

🟢 What do you use Gemma 4 for? — score 35 Sources: reddit/r/LocalLLaMA

Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks. So is Qwen outright better? In what case would you pick Gemma over Qwen?

🟢 How to get from vibe-coding to compounding revenue growth using AI agents for GTM — free session with ThriveStack + Brevo — score 26 Sources: reddit/r/AIAgents

Live Webinar

Register to join: [Registration Link](https://app.livestorm.co/brevo/thrivestack-x-brevo-vibe-ai-driven-playbook?utm_source=reddit&utm_medium=p

🟢 Radar Engineer to Autonomy/AI [D] — score 19 Sources: reddit/r/MachineLearning

Hi all, I’ve spent the last 3 years working on Radar Perception for a legacy automotive project in Germany. My background is an MSc in Robotics & AI. Currently, I spend my time analyzing point clouds and SNR distributions to debug failures. It’s mathematically complex, but I’m not implementing a

🟢 Wiki Builder: Skill to Build LLM Knowledge Bases — score 6 Sources: hackernews

🟢 What if AI agents can now talk? — score 0 Sources: reddit/r/AIAgents

Quick context: I use Claude Code and Codex daily and noticed I was spending half my "agent is working" time just sitting there watching the screen. I was like, what if Claude or Codex can just narrate its process back to me, so I know what it's doing?

So I built Heard. Open-source.

What it does:

Repo	Description	Stars Today	Language
cheahjs/free-llm-api-resources	A list of free LLM inference resources accessible via API.	344	python
bytedance/deer-flow	An open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of tasks that could take minutes to hours.	328	python
Arindam200/awesome-ai-apps	A collection of projects showcasing RAG, agents, workflows, and other AI use cases	211	python
vercel-labs/agent-browser	Browser automation CLI for AI agents	117	rust
vercel-labs/ai-cli	Generate anything from your terminal	80	typescript
PriorLabs/TabPFN	⚡ TabPFN: Foundation Model for Tabular Data ⚡	57	python
tensorzero/tensorzero	TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.	8	rust

📄 New Papers

Title	Category	Hotness	Link
PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination	research_paper	5	Open
X2SAM: Any Segmentation in Images and Videos	research_paper	16	Open
StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing	research_paper	2	Open
AI Agents for Sustainable SMEs: A Green ESG Assessment Framework	cs.AI	0	Open
ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations	cs.AI	0	Open
Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries	cs.AI	0	Open
Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries	cs.AI	0	Open
A Knowledge-Driven LLM-Based Decision-Support System for Explainable Defect Analysis and Mitigation Guidance in Laser Powder Bed Fusion	cs.AI	0	Open
Towards Multi-Agent Autonomous Reasoning in Hydrodynamics	cs.AI	0	Open
New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search	cs.AI	0	Open
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs	cs.AI	0	Open
Iterative Finetuning is Mostly Idempotent	cs.AI	0	Open
To Use AI as Dice of Possibilities with Timing Computation	cs.AI	0	Open
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents	cs.AI	0	Open
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment	cs.AI	0	Open

🐦 Twitter/X Highlights

Account	Tweet Summary
xai	Grok 4.3 is now live on the xAI API. It’s our fastest, most intelligent model to date. It tops the @ArtificialAnlys leaderboards in agentic tool calling and instruction following, and ranks #1 in @ValsAI enterprise domains like case law and corporate finance. Grok 4.3 supports a 1 million token context window and is priced at $1.25/m input and $2.50/m output. Create an API key and start building: http://console.x.ai/team/default/api-keys Post
OpenAI	Pinned: GPT-5.5 Instant is starting to roll out in ChatGPT. It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone. And it's also more concise, which we heard you wanted. We think you'll love chatting with it. Post

Account

Tweet Summary

xai

Grok 4.3 is now live on the xAI API. It’s our fastest, most intelligent model to date. It tops the @ArtificialAnlys leaderboards in agentic tool calling and instruction following, and ranks #1 in @ValsAI enterprise domains like case law and corporate finance. Grok 4.3 supports a 1 million token context window and is priced at $1.25/m input and $2.50/m output. Create an API key and start building: http://console.x.ai/team/default/api-keys Post

OpenAI

Pinned: GPT-5.5 Instant is starting to roll out in ChatGPT. It’s a big upgrade, giving you smarter, clearer, and more personalized answers in a warmer, more natural tone. And it's also more concise, which we heard you wanted. We think you'll love chatting with it. Post

Repeated From Recent Briefings

Hmbown/DeepSeek-TUI — Coding agent for DeepSeek models that runs in your terminal - first seen 2026-05-02; reason: canonical_url
ruvnet/ruflo — 🌊 The leading agent orchestration platform for Claude. Deploy intelligent multi-agent swarms, coordinate autonomous workflows, and build conversational AI systems. Features enterprise-grade architecture, self-learning swarm intelligence, RAG integration, and native Claude Code / Codex Integration - first seen 2026-05-02; reason: canonical_url
TauricResearch/TradingAgents — TradingAgents: Multi-Agents LLM Financial Trading Framework - first seen 2026-05-02; reason: canonical_url
Most people don’t need agents. They need cleaner workflows. - first seen 2026-05-05; reason: canonical_url
Google Chrome silently installs a 4 GB AI model on your device without consent - first seen 2026-05-05; reason: canonical_url
rtk-ai/rtk — CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies - first seen 2026-05-05; reason: canonical_url
AIDC-AI/Pixelle-Video — 🚀 AI 全自动短视频引擎 | AI Fully Automated Short Video Engine - first seen 2026-05-03; reason: canonical_url
virattt/dexter — An autonomous agent for deep financial research - first seen 2026-05-03; reason: canonical_url
raullenchai/Rapid-MLX — The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider. - first seen 2026-05-05; reason: canonical_url
cocoindex-io/cocoindex — Incremental engine for long horizon agents 🌟 Star if you like it! - first seen 2026-05-03; reason: canonical_url
... plus 458 more repeated items in processed data

AI Watchtower Briefing — 2026-05-06

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟡 Notable

Model Releases

Developer Tools

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

📈 Trending Repos

📄 New Papers

🐦 Twitter/X Highlights

Repeated From Recent Briefings