AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Opencode you naughty minx — score 96 Sources: reddit/r/LocalLLaMA

Man, AI agents getting pretty crazy these days. :) (local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)

🔴 I've shipped 3 products this year. None of them have users. Here's my problem. — score 79 Sources: reddit/r/AIAgents

I keep shipping products then stalling right before marketing. Anyone else break this pattern? I've noticed a recurring issue in my own work: I can build, design, and ship a product all the way to launch-ready — but when it's time to actually activate (cold outreach, Reddit posts, cold email sequenc

🔴 Dynamically allocating compute budget to hard set of problems and evolving the sections with Qwen-35B-A3B gets you near GPT-5.4-xHigh on HLE — score 71 Sources: reddit/r/LocalLLaMA

Developer Tools

🔴 I’m done paying for LLMs until they learn token efficiency — score 93 Sources: reddit/r/AIAgents

Watching my agent spin in circles, re-explaining the same steps over and over before maybe doing something useful. I keep telling it use “/caveman full” skill — short, direct, no fluff — and it just ignores me. More verbose walls of text. More wasted tokens. Why aren’t these models trained for age

🔴 joeseesun/qiaomu-anything-to-notebooklm — Claude Skill: Multi-source content processor for NotebookLM. Supports WeChat articles, web pages, YouTube, PDF, Markdown, search queries → Podcast/PPT/MindMap/Quiz etc. — score 79 Sources: github_trending

Claude Skill: Multi-source content processor for NotebookLM. Supports WeChat articles, web pages, YouTube, PDF, Markdown, search queries → Podcast/PPT/MindMap/Quiz etc.

Infrastructure & Compute

🔴 internlm/Intern-S2-Preview · Hugging Face — score 79 Sources: reddit/r/LocalLLaMA

Introduction We introduce Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. Beyond conventional parameter and data scaling, Intern-S2-Preview explores task scaling: increasing the difficulty, diversity, and coverage of scientific tasks to further unlock model

Other Signals

🔴 I believe there are entire companies right now under AI psychosis — score 90 Sources: hackernews

🔴 Backlash against Arxiv's proposed 1 year ban is genuinely perplexing. [D] — score 81 Sources: reddit/r/MachineLearning

Anyone else surprised at the enormous amount of backlash against Arxiv's proposed 1 year ban for authors and coauthors publishing papers with hallucinated reference and other obvious LLM/Gen AI artifacts? [https://x.com/tdietterich/status/2055000956144935055](https://x.com/tdietterich/status/2055000

🔴 Show HN: Watch a neural net learn to play Snake — score 70 Sources: hackernews

🟡 Notable

Model Releases

🟡 Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution — score 69 Sources: reddit/r/LocalLLaMA · hackernews

Code: https://github.com/chiennv2000/orthrus * Paper: https://arxiv.org/abs/2605.12825 * HF: https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B ; [https://huggingface.c

🟡 KDD 2026 Cycle 2 Results [D] — score 69 Sources: reddit/r/MachineLearning

Results for the research track have been released.

🟡 [FOUNDING] SupraLabs - real open-source AI models for you! — score 54 Sources: reddit/r/LocalLLaMA

https://preview.redd.it/k6lub2ypva1h1.png?width=1500&format=png&auto=webp&s=cd44452c86b5216fec17113a72f43bbf169edafb Hey r/LocalLLaMA ! We founded SupraLabs, and it's huge! # What we do? We train, finetune and explore small models with good results to revolutionize small AI models by

🟡 @xai: You can now use your @grok subscription inside @NousResearch Hermes Agent. http://x.ai/news/grok-hermes — score 50 Sources: twitter_rss

You can now use your @grok subscription inside @NousResearch Hermes Agent. http://x.ai/news/grok-hermes

Developer Tools

🟡 Your AI agent says "transferring you to a human" and then... nothing happens. Here's the pattern that actually fixes this. — score 64 Sources: reddit/r/AIAgents

I made a YouTube video about the most common failure point I see in WhatsApp AI deployments, and it's almost never discussed. Would love to share the topic and read your thoughts on the subject. The bot tells the customer "I'll connect you with a human agent." The customer waits. No one comes. They

🟡 github/awesome-copilot — Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot. — score 61 Sources: github_trending

Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.

Infrastructure & Compute

🟡 ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D] — score 56 Sources: reddit/r/MachineLearning

So I asked about people's experiences with ROCm in a post a few weeks or so ago https://www.reddit.com/r/MachineLearning/comments/1t6cng3/rocm_status_in_mid_2026_d/ I actually went and procured a RX 7900XTX

🟡 Are the rich RAM /poor GPU people wrong here? — score 46 Sources: reddit/r/LocalLLaMA

Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B params, the 100ish B billion params can be run on hybri

Research Papers

🟡 WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild — score 65 Sources: huggingface

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table

🟡 Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image — score 60 Sources: huggingface · arxiv/cs.AI

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models

🟡 Aligning Latent Geometry for Spherical Flow Matching in Image Generation — score 50 Sources: huggingface

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each

Other Signals

🟡 Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard! — score 62 Sources: reddit/r/LocalLLaMA

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard! little-coder × Qwen3.6-35B-A3B hit 24.6% (±3.2), and now land above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B on Terminus 2 (23.9%). I didn’t expect the scaffold-model gap from Polyglot to hold on

🟡 Frontier AI has broken the open CTF format — score 50 Sources: hackernews

🟢 Incremental

Model Releases

🟢 Luce Megakernal: Why nobody is taking about this? — score 29 Sources: reddit/r/LocalLLaMA

Everyone has been taking about Luce DFlash and PFlash. I just came across their megakernal and it seems it was released along with Dflash and PFlash. It seems it's giving them 1.8x greater speed with much more power efficiency on nvidia gpu comparable to the efficacy achieved on apple silicon! How's

🟢 [R] Which LLMs are actually best for bleeding-edge Linux/ML debugging workflows in 2026? [R] — score 6 Sources: reddit/r/MachineLearning

I’m trying to optimize an AI workflow for bleeding-edge Linux/ML debugging (Arch/CachyOS, CUDA, Python, unsloth, etc.). Current stack: - Claude = deep reasoning/mastermind - Gemini 3.1 Pro = execution/logistics - Perplexity = retrieval Main problem: Gemini often gives high-friction or impractical

Developer Tools

🟢 One thing that feels underdiscussed in AI right now: — score 29 Sources: reddit/r/AIAgents

We keep treating memory like a UX feature when it’s really becoming an operational problem. The hard part is not “can the model remember stuff?” The hard part is: * what happens when the memory is wrong * how you trace where a belief came from * how stale context gets replaced * how you migrate syst

🟢 i asked 23 companies how they actually test their AI agents before shipping. the answers genuinely scared me. — score 29 Sources: reddit/r/AIAgents

spent the last 3 weeks DMing CS leads, ops managers, and PMs at companies running AI agents in production. just one question: "how do you know your agent works before it goes live?" here's what i found: 17 out of 23 said some version of "we just ship it and watch slack for complaints" 4 used a sprea

🟢 How did you handle the team conversation when you rolled out AI customer support? — score 29 Sources: reddit/r/AIAgents

We're planning to roll out AI-assisted support in Q3. Technically the plan is solid. The part I'm not sure I'm handling well is the team communication. I have 11 support agents. The honest projection is that AI will handle 60–70% of ticket volume over time. I don't plan to do layoffs — the growth pl

🟢 PostHog/posthog — 🦔 PostHog is an all-in-one developer platform for building successful products. We offer product analytics, web analytics, session replay, error tracking, feature flags, experimentation, surveys, data warehouse, a CDP, and an AI product assistant to help debug your code, ship features faster, and keep all your usage and customer data in one stack. — score 21 Sources: github_trending

🦔 PostHog is an all-in-one developer platform for building successful products. We offer product analytics, web analytics, session replay, error tracking, feature flags, experimentation, surveys, data warehouse, a CDP, and an AI product assistant to help debug your code, ship features faster, and ke

🟢 What's in a GGUF, besides the weights - and what's still missing? — score 4 Sources: reddit/r/LocalLLaMA

Omitted 1 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🟢 Finding the 4x 3090 Sweet Spot — score 21 Sources: reddit/r/LocalLLaMA

https://preview.redd.it/8o43bjhe9d1h1.png?width=5346&format=png&auto=webp&s=1c87c2ee8b8ffff43495f543266056b0e26d3947 In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to understand the efficiency curve. Used this [blog

🟢 Struggling with Overfitting on Medical Imaging Task [D] — score 19 Sources: reddit/r/MachineLearning

Hi everyone, I’m working on a 2-class classification problem (LCA vs. RCA coronary arteries) using 2D X-ray angiograms. I’m currently stuck in a cycle of extreme overfitting and could use some advice on my training strategy. The Setup: * Dataset: Small (~900 training frames from ~300 unique DICOMs

Other Signals

🟢 Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm) — score 38 Sources: reddit/r/LocalLLaMA

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models). [https://oval-shell-31c.notion.site/Open-dLLM-Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a](https://oval-shell-31c.noti

🟢 PINN is predicting trivial solution for stiff ODE [D] — score 31 Sources: reddit/r/MachineLearning

I am learning physics informed neural networks. Currently, I am solving a simple second ODE (damped harmonic oscillator). The equation is m*d2y/dt2 + mu*dy/dt + k*y = 0 (bcs: y(t=0) = 1, y'(t=0) = 0). I managed to draft a code. The code works for k values upto 50. However, when increased the valu

🟢 AllenAI has been iterating on their MolmoAct2 models for robotics — score 12 Sources: reddit/r/LocalLLaMA

r/AllenAI is cooking with MolmoAct2, a 5B vision-language-action model for robot control. They keep releasing new fine-tunes on different kinds of robotics datasets, including (but not limited to, and they keep releasing new ones): * https://huggingface.co/allenai/MolmoAct2-LIBERO - general robotics

🟢 Someone Shared a Real Monet Painting as AI and Asked for Critiques — score 10 Sources: hackernews

🟢 GetMCP: Zero Trust for AI Agents — score 0 Sources: reddit/r/AIAgents

Just shipped v0.1.0 of something I've been building. Sharing because I haven't seen anyone solve this end-to-end as a self-hostable thing. The problem. AI agents (Claude, ChatGPT, Cursor, in-house bots) are starting to make real calls into production APIs. Most companies are handing them a single lo

Repo	Description	Stars Today	Language
joeseesun/qiaomu-anything-to-notebooklm	Claude Skill: Multi-source content processor for NotebookLM. Supports WeChat articles, web pages, YouTube, PDF, Markdown, search queries → Podcast/PPT/MindMap/Quiz etc.	438	python
github/awesome-copilot	Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.	105	python
PostHog/posthog	🦔 PostHog is an all-in-one developer platform for building successful products. We offer product analytics, web analytics, session replay, error tracking, feature flags, experimentation, surveys, data warehouse, a CDP, and an AI product assistant to help debug your code, ship features faster, and keep all your usage and customer data in one stack.	24	python
NVIDIA-NeMo/NeMo	A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)	5	python

📄 New Papers

Title	Category	Hotness	Link
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild	research_paper	8	Open
Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image	research_paper	4	Open
Aligning Latent Geometry for Spherical Flow Matching in Image Generation	research_paper	4	Open
Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity	cs.AI	0	Open
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology	cs.AI	0	Open
Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems	cs.AI	0	Open
PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts	cs.AI	0	Open
Conditional Attribute Estimation with Autoregressive Sequence Models	cs.AI	0	Open
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use	cs.AI	0	Open
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks	cs.AI	0	Open
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning	cs.AI	0	Open
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration	cs.AI	0	Open
ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation	cs.AI	0	Open
Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition	cs.AI	0	Open
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents	cs.AI	0	Open

🐦 Twitter/X Highlights

Account	Tweet Summary
xai	You can now use your @grok subscription inside @NousResearch Hermes Agent. http://x.ai/news/grok-hermes Post

Repeated From Recent Briefings

tinyhumansai/openhuman — Your Personal AI super intelligence. Private, Simple and extremely powerful. - first seen 2026-05-11
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding - first seen 2026-05-11
arXiv implements 1-year ban for papers containing incontrovertible evidence of unchecked LLM-generated errors, such as hallucinated references or results. [N] - first seen 2026-05-15
garrytan/gstack — Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA - first seen 2026-05-12
rohitg00/agentmemory — #1 Persistent memory for AI coding agents based on real-world benchmarks - first seen 2026-05-09
anthropics/skills — Public repository for Agent Skills - first seen 2026-05-11
Long Context Pre-Training with Lighthouse Attention - first seen 2026-05-08
K-Dense-AI/scientific-agent-skills — A set of ready to use Agent Skills for research, science, engineering, analysis, finance and writing. - first seen 2026-05-14
shiyu-coder/Kronos — Kronos: A Foundation Model for the Language of Financial Markets - first seen 2026-05-07
ViMU: Benchmarking Video Metaphorical Understanding - first seen 2026-05-15
... plus 289 more repeated items in processed data

AI Watchtower Briefing — 2026-05-16

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Introduction We introduce Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. Beyond conventional parameter and data scaling, Intern-S2-Preview explores task scaling: increasing the difficulty, diversity, and coverage of scientific tasks to further unlock model

Other Signals

🟡 Notable

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Other Signals

📄 New Papers

🐦 Twitter/X Highlights

Repeated From Recent Briefings

AI Watchtower Briefing — 2026-05-16

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Introduction We introduce Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. Beyond conventional parameter and data scaling, Intern-S2-Preview explores task scaling: increasing the difficulty, diversity, and coverage of scientific tasks to further unlock model

Other Signals

🟡 Notable

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Other Signals

📈 Trending Repos

📄 New Papers

🐦 Twitter/X Highlights

Repeated From Recent Briefings