AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 Introducing Gemma 4 12B: a unified, encoder-free multimodal model — score 90 Sources: reddit/r/LocalLLaMA · hackernews

🔴 Here is this month's experimentation: Grocery Agent — score 79 Sources: reddit/r/AIAgents

The idea is simple: A human can send grocery instructions in natural language on WhatsApp, and the rest of the system takes over. For example, the user does not need to open an app, search for products, compare prices, or manually build a cart. They can just say what they need in human

Business & Funding

🔴 Uber's $1,500/month AI limit is a useful signal for AI tool pricing — score 81 Sources: hackernews

Research Papers

🔴 OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs — score 95 Sources: huggingface

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce

🔴 Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning — score 72 Sources: huggingface · arxiv/cs.AI

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hac

Other Signals

🔴 google/gemma-4-12B · Hugging Face — score 96 Sources: reddit/r/LocalLLaMA

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 featur

🔴 Me visiting this sub — score 88 Sources: reddit/r/LocalLLaMA

🔴 More Gemma 4 models incoming — score 73 Sources: reddit/r/LocalLLaMA

https://x.com/i/status/2062237998415069224 possibly the 120B model

🟡 Notable

Model Releases

🟡 @OpenAI: We’re bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise scale. It brings GPT-5.5’s agentic coding and tool use together with stronger in — score 60 Sources: twitter_rss

We’re bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise scale. It brings GPT-5.5’s agentic coding and tool use together with stronger intelligence for drug discovery, analysis, design, and experimental workflows. https://openai.com/index

🟡 How Endava is redesigning software delivery around AI agents — score 50 Sources: lab_blog/OpenAI

Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to accelerate software delivery, automate workflows, and build an AI-native culture across the enterprise.

🟡 Introducing new capabilities to GPT-Rosalind — score 50 Sources: lab_blog/OpenAI

GPT-Rosalind advances life sciences research with enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and experimental workflow capabilities.

🟡 How Wasmer used Codex to build a Node.js runtime for the edge — score 50 Sources: lab_blog/OpenAI

See how Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development 10x to 20x and shipping in weeks instead of months.

🟡 @xai: Try Grok models on @Cloudflare's AI Gateway! — score 50 Sources: twitter_rss

Try Grok models on @Cloudflare's AI Gateway!

Omitted 2 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

🟡 Analysis of AlphaZero training data [D] — score 69 Sources: reddit/r/MachineLearning

I am trying to train an AlphaZero model for Othello on a 6x6-board. Having been warned that too little exploration during data generation can lead to models being overconfident and trapped in some tight region of the search tree, I started with the value c_puct = 4.0, and then reduced this to 3.5 a

🟡 what broke first when your ai agent got real tool access? for us it wasn't the model — score 57 Sources: reddit/r/AIAgents

The first thing that broke for us wasnt reasoning, it was tool ambiguity. Once the agent could touch real systems, the model mostly did what you'd expect. The messy part was that tools looked obvious to us and weirdly interchangeable to teh agent. Two actions with similar names, slightly differe

🟡 @simonw: Uber reportedly now caps coding agents at $1,500/month per employee per tool - seems sensible to me, but it's also an interesting hint at the value Uber thinks these tools are providing https://simonw — score 50 Sources: twitter_rss

Uber reportedly now caps coding agents at $1,500/month per employee per tool - seems sensible to me, but it's also an interesting hint at the value Uber thinks these tools are providing https://simonwillison.net/2026/Jun/3/uber-caps-usage/

🟡 interviewstreet/hiring-agent — AI agent to evaluate and score resumes. — score 49 Sources: github_trending

AI agent to evaluate and score resumes.

Business & Funding

🟡 Perplexity is STEALING from users, violating Law and hiding behind their AI bots Sam — score 57 Sources: reddit/r/AIAgents

This is not about the money. It’s about the principle. We are constantly told that AI is here to "help" us, but multi-million dollar companies like Perplexity are weaponizing their own AI to steal from regular users, stonewall our complaints, and blatantly violate consumer rights. It is systemic co

Research Papers

🟡 Unlocking Feature Learning in Gated Delta Networks at Scale — score 50 Sources: huggingface · arxiv/cs.AI

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformer

🟡 STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations — score 50 Sources: huggingface · arxiv/cs.CL

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLM

🟡 Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning — score 40 Sources: huggingface · arxiv/cs.LG

Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flex

Other Signals

🟡 Artificial intelligence is not conscious – Ted Chiang — score 69 Sources: hackernews

🟡 NeurIPS used uncalibrated AI detector for desk rejections [D] — score 64 Sources: reddit/r/MachineLearning

I recently had a submission desk-rejected from the NeurIPS 2026 Position Paper Track for an alleged AI-policy violation. After corresponding with the track leadership and reading their public blog post, I think the broader methodological issue is worth discussing here. The track used Pangram, a prop

🟡 Let us let Google know that we want the Gemma 4 124b — score 58 Sources: reddit/r/LocalLLaMA

Gemma 4 is good, great even but it's missing that one last step from being Legendary. Let us make noise and let Google know that we want the 124b Gemma 4 variant - please let them know: https://huggingface.co/google/gemma-4-12B-it/discussions

🟡 First paper acceptance (ICML Workshop), should I attend? [D] — score 56 Sources: reddit/r/MachineLearning

I just finished my first year of undergrad, and I got my first first-author paper accepted to an ICML workshop! Super stoked, especially since I was lowk a crashout in high school I wanted to know if it is worth it for me to go? It's quite expensive, and I will be the only one in my lab in attendanc

🟡 Failing grades soar with AI usage, dwindling math skills in Berkeley CS classes — score 56 Sources: hackernews

Omitted 3 additional other signals items from the main section; see raw data and source-specific sections below.

🟢 Incremental

Model Releases

🟢 Trump signs narrower executive order on AI oversight after industry objections — score 19 Sources: reddit/r/LocalLLaMA

https://techcrunch.com/2026/06/02/trump-signs-narrower-executive-order-on-ai-oversight-after-industry-objections/ I presume open weight US models that are considered "powerful" will n

🟢 The ways we contain Claude across products — score 19 Sources: hackernews

🟢 Best Visual Reasoning Model in 2026 (Including APIs) [D] — score 12 Sources: reddit/r/MachineLearning

For example, suppose I have a one-hour video and I provide it to ChatGPT or another AI model. If I ask complex reasoning questions about the video, which models are best suited for long-horizon video understanding and reasoning? Which models can produce the most reliable answers in this scenario?

🟢 Gemma 4 QAT confirmed to release soon! — score 4 Sources: reddit/r/LocalLLaMA

It seems like this comment has gone widely unnoticed. https://old.reddit.com/r/LocalLLaMA/comments/1tvtn6m/googlegemma412b_hugging_face/opjj681/ Maybe hold off on testing quantization and wait for it's re

Developer Tools

🟢 Repo for implementations of various Transformer Attn mechanisms [P] — score 38 Sources: reddit/r/MachineLearning

Initially, I developed this so I can easily switch between different Attention mechanisms for my Small Language Model (SLM) experiments and benchmarking. However, I also realized that these implementations can be applicable in Computer Vision, modernize Vision Encoders, RL, and others. I hope this h

🟢 0x4m4/hexstrike-ai — HexStrike AI MCP Agents is an advanced MCP server that lets AI agents (Claude, GPT, Copilot, etc.) autonomously run 150+ cybersecurity tools for automated pentesting, vulnerability discovery, bug bounty automation, and security research. Seamlessly bridge LLMs with real-world offensive security capabilities. — score 28 Sources: github_trending

HexStrike AI MCP Agents is an advanced MCP server that lets AI agents (Claude, GPT, Copilot, etc.) autonomously run 150+ cybersecurity tools for automated pentesting, vulnerability discovery, bug bounty automation, and security research. Seamlessly bridge LLMs with real-world offensive security capa

🟢 Gemma 4 12B first coding agent test on a 4080 Super — score 27 Sources: reddit/r/LocalLLaMA

Just threw the new Gemma 4 12B into VSCodium with the Pi Agent extension to see how it handles tools, and it nailed the test on the first try. I gave it a prompt to write a Python script that reads logs line-by-line, grabs the error modules, and dumps the counts to a JSON file. I also told it to mak

🟢 Encodec.cpp, a portable C++ implementation of Meta's EnCodec using Eigen [P] — score 18 Sources: reddit/r/MachineLearning

I built a C++ implementation of Meta’s EnCodec using Eigen. Github: https://github.com/pfeatherstone/encodec.cpp Motivation: - A lightweight implementation of EnCodec with no runtime dependencies, in C++ - No ML runtime

🟢 graykode/abtop — Like htop, but for AI coding agents. Monitor Claude Code & Codex CLI sessions, tokens, context window, rate limits, and ports in real-time. — score 15 Sources: github_trending

Like htop, but for AI coding agents. Monitor Claude Code & Codex CLI sessions, tokens, context window, rate limits, and ports in real-time.

Omitted 3 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🟢 Your AI has been building a picture of you for months. You have never seen it. — score 14 Sources: reddit/r/AIAgents

Every conversation shapes what it thinks you do, what you care about, how you like to work. It guessed your role from a passing comment. It assumed your preferences from one data point. It held onto that assumption for months. And you have no way to check unless you go digging. Most AI tools have no

Business & Funding

🟢 How can the numbers be this massive within a month ?? — score 35 Sources: reddit/r/LocalLLaMA

Why does it feel like these downloads are just inflated by the brain dead enterprises whose employees even after exhausting their $ 1500 montly credits are not able to cache it in a shared storage by prompting their AI waifu "Do not download it ever again every time my container gets TURNEDDD ONN!!!

🟢 I built a vulnerable app and spent $1,500 seeing if LLMs could hack it — score 31 Sources: hackernews

Other Signals

🟢 The first Gemma 4 12B finetunes are ready — score 12 Sources: reddit/r/LocalLLaMA

Now you can start building your Gemma 4 12B collection :) https://huggingface.co/igorls/gemma-4-12B-it-heretic-GGUF https://huggingface.co/ReadyArt/Melody1437-12B-v0.4-GGUF [https

🟢 I think I accidentally built a proto-cognitive system (not just another chatbot) that persists, adapts, and self-regulates over time :O — score 6 Sources: reddit/r/AIAgents

I’ve been working on a local AI project called DRIFT, and I just finished running a full benchmark + state analysis on it. This isn’t just prompt engineering or wrapper logic around an LLM. The system has: * Episodic memory (vector + structured) * Homeostasis (needs, regulation, crisis events) * Con

Repo	Description	Stars Today	Language
interviewstreet/hiring-agent	AI agent to evaluate and score resumes.	119	python
0x4m4/hexstrike-ai	HexStrike AI MCP Agents is an advanced MCP server that lets AI agents (Claude, GPT, Copilot, etc.) autonomously run 150+ cybersecurity tools for automated pentesting, vulnerability discovery, bug bounty automation, and security research. Seamlessly bridge LLMs with real-world offensive security capabilities.	38	python
graykode/abtop	Like htop, but for AI coding agents. Monitor Claude Code & Codex CLI sessions, tokens, context window, rate limits, and ports in real-time.	21	rust
NVIDIA-NeMo/Gym	Evaluate and improve models and agents using environments	1	python

📄 New Papers

Title	Category	Hotness	Link
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs	research_paper	24	Open
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning	research_paper	18	Open
Unlocking Feature Learning in Gated Delta Networks at Scale	research_paper	3	Open
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations	research_paper	3	Open
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification	cs.AI	0	Open
Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection	cs.AI	0	Open
Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research	cs.AI	0	Open
SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models	cs.AI	0	Open
Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal	cs.AI	0	Open
VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark	cs.AI	0	Open
StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis	cs.AI	0	Open
Can Generalist Agents Automate Data Curation?	cs.AI	0	Open
Characterizing initial human-AI proof formalization workflows	cs.AI	0	Open
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents	cs.AI	0	Open
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: How Endava is redesigning software delivery around AI agents
OpenAI: Introducing new capabilities to GPT-Rosalind
OpenAI: How Wasmer used Codex to build a Node.js runtime for the edge

🐦 Twitter/X Highlights

Account	Tweet Summary
OpenAI	We’re bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise scale. It brings GPT-5.5’s agentic coding and tool use together with stronger intelligence for drug discovery, analysis, design, and experimental workflows. https://openai.com/index Post
AnthropicAI	How well do the security community's techniques hold up against AI-enabled cyberattacks? We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors. Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-t Post
xai	Try Grok models on @Cloudflare's AI Gateway! Post
xai	Meet Go by Gopuff and SpaceXAI: your personal shopping assistant that knows what you want and delivers in minutes. Powered by Grok text, audio, and image models. Post
simonw	Uber reportedly now caps coding agents at $1,500/month per employee per tool - seems sensible to me, but it's also an interesting hint at the value Uber thinks these tools are providing https://simonwillison.net/2026/Jun/3/uber-caps-usage/ Post

Repeated From Recent Briefings

chopratejas/headroom — Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server. - first seen 2026-06-03
NousResearch/hermes-agent — The agent that grows with you - first seen 2026-05-11
Most of the software you rely on was hacked together fast - first seen 2026-06-03
farion1231/cc-switch — A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Gemini CLI & Hermes Agent. Only official website: ccswitch.io - first seen 2026-05-08
nesquena/hermes-webui — Hermes WebUI: The best way to use Hermes Agent from the web or from your phone! - first seen 2026-06-01
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems - first seen 2026-05-28
Open-LLM-VTuber/Open-LLM-VTuber — Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms - first seen 2026-05-08
supermemoryai/supermemory — Memory engine and app that is extremely fast, scalable. The Memory API for the AI era. - first seen 2026-06-01
MiniMax dropped a new attention architecture. [N] - first seen 2026-06-03
anomalyco/opencode — The open source coding agent. - first seen 2026-05-09
... plus 140 more repeated items in processed data

AI Watchtower Briefing — 2026-06-04

🔴 High Significance

Model Releases

Business & Funding

Research Papers

Other Signals

🟡 Notable

Model Releases

Developer Tools

Business & Funding

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Business & Funding

Other Signals

📈 Trending Repos

📄 New Papers

🏢 Lab Blog Posts

🐦 Twitter/X Highlights

Repeated From Recent Briefings