AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). — score 94 Sources: reddit/r/LocalLLaMA

Hi all, I have been making a lot of updates to my project, and I wanted to share them here. TextGen (previously text-generation-webui, also known as my username oobabooga or ooba) has been in development since December 2022, before LLaMa and llama.cpp existed. In the last two months, the project has

🔴 DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute — score 94 Sources: reddit/r/AIAgents

DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute TL;DR: Built a local "OS for AI agents" that scans your entire repo into a live graph (Worm), routes tasks between local Qwen, headless ChatGPT browser sessions via Tor/antidetect, and OpenRou

🔴 Claude for Small Business — score 70 Sources: hackernews

Developer Tools

🔴 Human-level performance via ML was not proven impossible with complexity theory [D] — score 94 Sources: reddit/r/MachineLearning

Van Rooij, Guest, de Haan, Adolfi, Kolokolova, and Rich claimed to have proven that AGI via ML is impossible in Computational Brain & Behavior in 2024. The basic idea was to try to reduce a known NP-hard problem to the problem of

🔴 Feels like building AI apps is becoming infrastructure engineering — score 81 Sources: reddit/r/AIAgents

I started experimenting with AI apps because it felt fast and exciting. Now every workflow somehow involves frameworks, vector DBs, orchestration, observability, memory systems, evals, and constant debugging. Wondering if others feel the same lately.

🔴 we really all are going to make it, aren't we? 2x3090 setup. — score 72 Sources: reddit/r/LocalLLaMA

i'm blown away. i saw someone made a post the other day about "club-3090" and after having sonnet patch some fixes into it, specifically a sse-session drop bug and a bug with tool-calling, it's fair to say that even "budget" setups like myself will have a path forward soon for only-local-ai. referen

Enterprise Adoption

🔴 Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options? — score 83 Sources: reddit/r/LocalLLaMA

Google is closing its free tier to just 50 domains for site-specific search, and an inheritance date of January 1st, 2027, with no public pricing being listed for advanced searches. Cloudflare's new site-default is to challenge all AI bots attempting to scrape web-information for all their customers

Research Papers

🔴 FrameSkip: Learning from Fewer but More Informative Frames in VLA Training — score 75 Sources: huggingface

Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long l

Other Signals

🔴 Built Support Vector Machine(SVM) from scratch in Rust [P] — score 81 Sources: reddit/r/MachineLearning

Built my own SVM classifier from scratch in Rust. It uses SMO optimization, have linear and rbf kernel, uses grid search to tune the hyperparameters. I tested it on two datasets one using Linear dataset and other using RBF, these were the results: |Dataset|Kernel|Accuracy|Recall|F1| |:-|:-|:-|:-|:-|

🟡 Notable

Developer Tools

🟡 Most AI-generated apps are complete slop. Controversial take: it’s not AI’s fault — score 69 Sources: reddit/r/AIAgents

AI gets blamed for making boring products, but I think that’s backwards. The problem isn’t that AI can’t build. The problem is that we continually hand it dead ideas. “Build me a productivity app.” “Build me a habit tracker.” “Build me a dashboard for small businesses.” Of course the output feels ge

🟡 opendatalab/MinerU — Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows. — score 65 Sources: github_trending

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

🟡 K-Dense-AI/scientific-agent-skills — A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing. — score 63 Sources: github_trending

A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.

🟡 openai/whisper — Robust Speech Recognition via Large-Scale Weak Supervision — score 56 Sources: github_trending

Robust Speech Recognition via Large-Scale Weak Supervision

🟡 Building a safe, effective sandbox to enable Codex on Windows — score 50 Sources: lab_blog/OpenAI

Learn how OpenAI built a secure sandbox for Codex on Windows, enabling safe, efficient coding agents with controlled file access and network restrictions.

Infrastructure & Compute

🟡 Trained transformer-based chess models to play like humans (including thinking time) [P] — score 44 Sources: reddit/r/MachineLearning

I trained a set of deep learning (transformer-based) chess models to play like humans (inspired by MAIA and Grandmaster Chess Without Search). There's a separate model for each 100-point rating bucket from ~800 to 2500+. I started with training a mid-strength model from scratch on a 8xH100 cluster,

Research Papers

🟡 RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation — score 58 Sources: huggingface · arxiv/cs.AI

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions

🟡 Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs — score 55 Sources: huggingface

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In

🟡 Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition — score 52 Sources: huggingface · arxiv/cs.AI

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across fo

🟡 PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents — score 42 Sources: huggingface · arxiv/cs.CL

We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorpora

🟡 F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking — score 42 Sources: huggingface · arxiv/cs.LG

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within

Other Signals

🟡 MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) — score 61 Sources: reddit/r/LocalLLaMA

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card. (Bench has been done with

🟡 Have the "on-hold" durations been getting longer for arXiv submissions? [D] — score 56 Sources: reddit/r/MachineLearning

I have a paper that has been "on-hold" for about 2 weeks now. I understand that it might take a little longer now because of inundation of AI generated low-effort papers but my papers have gone from "on-hold" to "submitted" within a couple of days in the past. Wondering if anyone else is facing the

🟡 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) — score 50 Sources: reddit/r/LocalLLaMA

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): |Model|tok/s|Key flags| |:

🟡 The US is winning the AI race where it matters most: commercialization — score 50 Sources: hackernews

🟡 @swyx: if your reaction to this is “haha openclaw bad, see prompt injection is the #1 danger” you: 1) havent sufficiently appreciated the layers to this tweet 2) havent seen enough ai api keys — score 50 Sources: twitter_rss

if your reaction to this is “haha openclaw bad, see prompt injection is the #1 danger” you: 1) havent sufficiently appreciated the layers to this tweet 2) havent seen enough ai api keys

🟢 Incremental

Model Releases

🟢 What revenue model would you guys suggest for our automation orchestration platform open to public agents as a marketplace? — score 38 Sources: reddit/r/AIAgents

This is not a promotion. We are looking for suggestions. So we are very close to launching an automation orchestration platform where any developer can list their agent based on platform specifications to perform any specific task that can be used as a building block of a larger flow by anyone. The

🟢 A Claude Code and Codex Skill for Deliberate Skill Development — score 10 Sources: hackernews

🟢 Simpler self hosted alt to Open WebUI — score 8 Sources: reddit/r/LocalLLaMA

Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow. Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for non-technical people. It often feels more like a dev tool tha

🟢 The "the future is fictional" problem of many local LLMs — score 6 Sources: reddit/r/LocalLLaMA

Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web access can have this sometimes. But it stops when you give it t

Developer Tools

🟢 Side Projects. — score 39 Sources: reddit/r/LocalLLaMA

Little something I put together to play with for larger contexts than my 9070xt. 8700k, dual P100's, 16gb DDR4, 32gb Optane, Samsung sata SSD. Nothing too fancy. Anyone else do a recent build? How's it working out?

🟢 Spent weeks debugging my agent in Langchain before realizing the framework was the problem. — score 38 Sources: reddit/r/AIAgents

Spent way too long thinking complexity in my agent was a me problem. Bad prompts, bad memory setup, bad tool definitions. Kept tweaking Langchain configs trying to fix behavior I couldn't even properly observe. Turns out half the problem was I had no idea what was actually happening under the hood.

🟢 TraceMind – open source LLM quality monitoring with a ReAct agent that investigates why your AI started giving wrong answers — score 38 Sources: reddit/r/AIAgents

Background: I was building a multi-agent system. Changed one line in a system prompt. Quality dropped from 84% to 52% pass rate. HTTP 200 the whole time. Found out 11 days later from a user. That incident made me realize LLM apps have a monitoring gap that doesn't exist in traditional software. When

🟢 Local services data is the biggest gap for AI agents. Am I wrong? — score 38 Sources: reddit/r/AIAgents

I've been building agents that need to interact with the real, physical world; things like "find me a plumber available tomorrow under $80/hr" or "compare 3 electricians near me." And I keep hitting the same wall: this data simply doesn't exist in structured form. * Pricing? Buried in a 2015 PDF on

🟢 NVIDIA/OpenShell — OpenShell is the safe, private runtime for autonomous AI agents. — score 37 Sources: github_trending

OpenShell is the safe, private runtime for autonomous AI agents.

Omitted 3 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🟢 ansible/ansible — Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy and maintain. Automate everything from code deployment to network configuration to cloud management, in a language that approaches plain English, using SSH, with no agents to install on remote systems.https://docs.ansible.com. — score 13 Sources: github_trending

Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy and maintain. Automate everything from code deployment to network configuration to cloud management, in a language that approaches plain English, using SSH, with no agents to install on rem

🟢 huggingface/pytorch-image-models — The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more — score 3 Sources: github_trending

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt,

Business & Funding

🟢 Rare event prediction on time series that change structure mid-stream? [D] — score 0 Sources: reddit/r/MachineLearning

Hi reddit! I made this post on r/MLQuestions, but I am posting it here too for spread:) This is a case I have been assigned at work and I'd love input from anyone who's tackled something similar. I'm building a failure prediction model for ~33k chargers. The devices emit data at two very different

Research Papers

🟢 From Pixels to Concepts: Do Segmentation Models Understand What They Segment? — score 15 Sources: huggingface

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts.

Other Signals

🟢 Arena AI Model ELO History — score 30 Sources: hackernews

🟢 Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup? — score 17 Sources: reddit/r/LocalLLaMA

So I've been going down a rabbit hole lately and I can't find many people actually talking about this specific use case. everyone here runs local LLMs for coding, chat, maybe some creative writing. cool. But what about using it as a proper personal knowledge base? like, dump your own notes, PDFs, ra

🟢 GPT 5.5 v/s GPT 5.4. Paying 63% more just for 0.1 point difference! — score 6 Sources: reddit/r/AIAgents

was running cost comparisons on codex models this week and kept assuming gpt-5.5 would justify the premium because it benchmarks highest. the thing i keep noticing is that raw benchmark scores and cost-adjusted scores are almost completely disconnected, and people treat them like they're the same nu

Repo	Description	Stars Today	Language
opendatalab/MinerU	Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.	129	python
K-Dense-AI/scientific-agent-skills	A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing.	99	python
openai/whisper	Robust Speech Recognition via Large-Scale Weak Supervision	68	python
NVIDIA/OpenShell	OpenShell is the safe, private runtime for autonomous AI agents.	38	rust
ErlichLiu/Proma	把最丝滑的通用 Agent 体验带进你的工作流，为 100x 专业用户而生的未来产品，正在实现 proactive Agent 阶段。基于 Claude Agent SDK 的完整开源实践，原生支持飞书群聊调用、灵活接入任意大模型供应商 —— 让顶级 Agent 能力真正跑在你每天用的地方。	35	typescript
EleutherAI/lm-evaluation-harness	A framework for few-shot evaluation of language models.	22	python
ansible/ansible	Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy and maintain. Automate everything from code deployment to network configuration to cloud management, in a language that approaches plain English, using SSH, with no agents to install on remote systems.https://docs.ansible.com.	18	python
huggingface/pytorch-image-models	The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more	8	python

📄 New Papers

Title	Category	Hotness	Link
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training	research_paper	19	Open
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation	research_paper	3	Open
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs	research_paper	7	Open
Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition	research_paper	2	Open
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents	cs.AI	0	Open
Macro-Action Based Multi-Agent Instruction Following through Value Cancellation	cs.AI	0	Open
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack	cs.AI	0	Open
Revealing Interpretable Failure Modes of VLMs	cs.AI	0	Open
Learning Transferable Latent User Preferences for Human-Aligned Decision Making	cs.AI	0	Open
On the Size Complexity and Decidability of First-Order Progression	cs.AI	0	Open
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models	cs.AI	0	Open
CHAL: Council of Hierarchical Agentic Language	cs.AI	0	Open
BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics	cs.AI	0	Open
State-Centric Decision Process	cs.AI	0	Open
PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: Building a safe, effective sandbox to enable Codex on Windows

🐦 Twitter/X Highlights

Account	Tweet Summary
swyx	if your reaction to this is “haha openclaw bad, see prompt injection is the #1 danger” you: 1) havent sufficiently appreciated the layers to this tweet 2) havent seen enough ai api keys Post

Repeated From Recent Briefings

NousResearch/hermes-agent — The agent that grows with you - first seen 2026-05-11
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image - first seen 2026-05-12
tinyhumansai/openhuman — Your Personal AI super intelligence. Private, Simple and extremely powerful. - first seen 2026-05-11
rohitg00/agentmemory — #1 Persistent memory for AI coding agents based on real-world benchmarks - first seen 2026-05-09
farion1231/cc-switch — A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Gemini CLI & Hermes Agent. Only official website: ccswitch.io - first seen 2026-05-08
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model - first seen 2026-05-13
garrytan/gstack — Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA - first seen 2026-05-12
yikart/AiToEarn — Let's use AI to Earn! - first seen 2026-05-11
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling - first seen 2026-05-13
anthropics/skills — Public repository for Agent Skills - first seen 2026-05-11
... plus 117 more repeated items in processed data

AI Watchtower Briefing — 2026-05-14

🔴 High Significance

Model Releases

Developer Tools

Enterprise Adoption

Research Papers

Other Signals

🟡 Notable

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Business & Funding

Research Papers

Other Signals

📈 Trending Repos

📄 New Papers

🏢 Lab Blog Posts

🐦 Twitter/X Highlights

Repeated From Recent Briefings