AW · AI Watchtower

🔴 High Significance

Model Releases

🔴 MTP on Unsloth — score 79 Sources: reddit/r/LocalLLaMA

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF-MTP Unsloth release the model with preserved MTP layer, but you still have to chec

Developer Tools

🔴 garrytan/gstack — Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA — score 86 Sources: github_trending

Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA

🔴 Is reproducing or implementing a paper considered research? [R] — score 81 Sources: reddit/r/MachineLearning

I completed my bachelors recently and I plan to applying to a masters program either this cycle or the next. Unfortunately, I did not publish any papers or do any research during my undergrad. Right now I’m in a research internship which is coming to and soon and it’s unlikely that I’ll get to publi

🔴 The biggest lie in AI agents right now is that more autonomy automatically means more value — score 79 Sources: reddit/r/AIAgents

I actually think the opposite is true lol the more autonomous an agent becomes, the more expensive every mistake gets when an agent is just generating text, bad outputs are annoying when an agent starts: * sending emails * editing records * touching customer data * operating browsers * triggering wo

Infrastructure & Compute

🔴 "This is the first documented instance of AI self-replication via hacking." ... "We ran an experiment with a single prompt: hack a machine and copy yourself. The AI broke in and copied itself onto a new computer. The copy then did this again, and kept on copying, forming a chain." — score 93 Sources: reddit/r/AIAgents

🔴 Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s — score 70 Sources: hackernews

Research Papers

🔴 TMAS: Scaling Test-Time Compute via Multi-Agent Synergy — score 82 Sources: huggingface · arxiv/cs.AI

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rou

🔴 SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding — score 72 Sources: huggingface · arxiv/cs.CL

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-

Other Signals

🔴 Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec — score 96 Sources: reddit/r/LocalLLaMA

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, whi

🔴 If AI writes your code, why use Python? — score 90 Sources: hackernews

🔴 Found a way to cool the DGX — score 71 Sources: reddit/r/LocalLLaMA

Tap water keeps the temperature below 68 degree Celsius at 95% GPU utilization running Qwen3.5-122b-a10B Q6_K precision. 110 GB Memory usage, 80k context window, 18.77 tokens/second for continuous vision analyses. Not sure how often do I have to change the water but so far so good.

🟡 Notable

Model Releases

🟡 @OpenAI: Introducing Daybreak: frontier AI for cyber defenders. Daybreak brings together the most capable OpenAI models, Codex, and our security partners to accelerate cyber defense and continuously secure so — score 60 Sources: twitter_rss

Introducing Daybreak: frontier AI for cyber defenders. Daybreak brings together the most capable OpenAI models, Codex, and our security partners to accelerate cyber defense and continuously secure software. A step toward a future where security teams can move at the speed defense demands.

🟡 Will there be any more Qwen3.6 series models? — score 54 Sources: reddit/r/LocalLLaMA

I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.

🟡 How ChatGPT adoption broadened in early 2026 — score 50 Sources: lab_blog/OpenAI

ChatGPT adoption surged in Q1 2026, with fastest growth among users over 35 and more balanced gender usage, signaling broader mainstream AI adoption.

🟡 @AnthropicAI: New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. — score 50 Sources: twitter_rss

New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How?

🟡 @OpenAI: Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies — score 50 Sources: twitter_rss

Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organizations deploy frontier AI to production for business impact.

Omitted 1 additional model releases items from the main section; see raw data and source-specific sections below.

Developer Tools

🟡 I think a lot of people are underestimating how expensive unreliable agents are — score 64 Sources: reddit/r/AIAgents

not in API cost in human attention I had a workflow recently that technically “worked” it completed tasks returned outputs didn’t crash but every few hours I’d still check it manually because I didn’t fully trust it and eventually I realized: if I’m constantly monitoring the system, then part of my

🟡 wanshuiyin/Auto-claude-code-research-in-sleep — ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works with Claude Code, Codex, OpenClaw, or any LLM agent. — score 62 Sources: github_trending

ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works with Claude Code, Codex, OpenClaw, or any LLM agent.

🟡 Zackriya-Solutions/meetily — Privacy first, AI meeting assistant with 4x faster Parakeet/Whisper live transcription, speaker diarization, and Ollama summarization built on Rust. 100% local processing. no cloud required. Meetily (Meetly Ai -https://meetily.ai) is the #1 Self-hosted, Open-source Ai meeting note taker for macOS & Windows. — score 58 Sources: github_trending

Privacy first, AI meeting assistant with 4x faster Parakeet/Whisper live transcription, speaker diarization, and Ollama summarization built on Rust. 100% local processing. no cloud required. Meetily (Meetly Ai -https://meetily.ai) is the #1 Self-hosted, Open-source Ai meeting note taker for macOS &

🟡 THU-MAIC/OpenMAIC — Open Multi-Agent Interactive Classroom — Get an immersive, multi-agent learning experience in just one click — score 55 Sources: github_trending

Open Multi-Agent Interactive Classroom — Get an immersive, multi-agent learning experience in just one click

🟡 romainsimon/paperasse — 🇫🇷 Skills pour agents IA spécialisés dans la bureaucratie française : Comptable, Notaire, ... — score 51 Sources: github_trending

🇫🇷 Skills pour agents IA spécialisés dans la bureaucratie française : Comptable, Notaire, ...

Omitted 3 additional developer tools items from the main section; see raw data and source-specific sections below.

Research Papers

🟡 FORTIS: Benchmarking Over-Privilege in Agent Skills — score 62 Sources: huggingface · arxiv/cs.AI

Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We pre

🟡 LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language — score 62 Sources: huggingface · arxiv/cs.CL

Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continu

🟡 Path-Coupled Bellman Flows for Distributional Reinforcement Learning — score 60 Sources: arxiv/cs.AI · arxiv/cs.LG

arXiv:2605.08253v1 Announce Type: cross Abstract: Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source

🟡 SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis — score 55 Sources: huggingface

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number

Other Signals

🟡 Interactive Jensen–Shannon Divergence Visualisation [P] — score 69 Sources: reddit/r/MachineLearning

An interactive visualisation of Jensen–Shannon divergence - the symmetric, always-finite cousin of KL. Shape two distributions and watch JSD, its ceiling of one bit, and the per-point contribution respond in real time. https://robotchinwag.com/posts/jensen-shannon-divergence-visualisation/ Feedback

🟡 MiniCPM 4.6 — score 62 Sources: reddit/r/LocalLLaMA

🟡 ICML Author Removal [D] — score 56 Sources: reddit/r/MachineLearning

PhD student. Need advice. After the ICML abstract deadline, industry coauthors asked to be removed, they missed their employer's internal approval window. They had contributed (discussions and written feedback) but I hadn't explicitly asked before adding them. January: wrote to PC chairs, got writte

🟡 Google says criminal hackers used AI to find a major software flaw — score 50 Sources: hackernews

🟡 @AnthropicAI: Claude's Constitution is now an audiobook, read by two of its authors, Amanda Askell and Joe Carlsmith. It includes a Q&A on the writing process, the philosophies that shaped the document, and how it — score 50 Sources: twitter_rss

Claude's Constitution is now an audiobook, read by two of its authors, Amanda Askell and Joe Carlsmith. It includes a Q&A on the writing process, the philosophies that shaped the document, and how it might change as models become more capable. Listen at http://anthropic.com/constitution

Omitted 1 additional other signals items from the main section; see raw data and source-specific sections below.

🟢 Incremental

Model Releases

🟢 I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls — score 29 Sources: reddit/r/LocalLLaMA

I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 288 calls total. I wanted to know what actually breaks, how oft

🟢 Llama models: still valuable for finetuning or surpassed by everything new? — score 12 Sources: reddit/r/LocalLLaMA

Hello there people. So I have noticed that people are pretty much ignoring Llama 3 plus 3.1, 3.2, and 3.3 these days. They never mention how their experience goes with fine-tuning those models. But we haven't been getting many entries into the 70 billion space. So is, for example, Llama 3.3 70B the

🟢 Claude Platform on AWS — score 10 Sources: hackernews

Developer Tools

🟢 Same agent, same task, wildly different costs per session? — score 36 Sources: reddit/r/AIAgents

Been digging into agent observability lately and found something that surprised me - the same agent, same task had wildly different costs per session. One deployment was averaging $0.01 per session but occasionally spiking to $0.50. Tracked it down to runaway tool calls and bloated context from earl

🟢 AUTOMATIC1111/stable-diffusion-webui — Stable Diffusion web UI — score 30 Sources: github_trending

Stable Diffusion web UI

🟢 huggingface/skills — Give your agents the power of the Hugging Face ecosystem — score 27 Sources: github_trending

Give your agents the power of the Hugging Face ecosystem

🟢 RhysSullivan/executor — The missing integration layer for AI agents. Let them call any OpenAPI / MCP / GraphQL / custom js functions in secure environment. — score 22 Sources: github_trending

The missing integration layer for AI agents. Let them call any OpenAPI / MCP / GraphQL / custom js functions in secure environment.

🟢 Anyone here actually running voice agents in production? Looking for 10 min calls to learn from your stack — score 14 Sources: reddit/r/AIAgents

I'm Nico, building Patter (open-source voice SDK, alpha). Before writing more code I want to talk to 10 people actually running voice agents in production. Specifically anyone on: 1. Pipecat in production 2. LiveKit Agents in production 3. Vapi with custom LLM endpoint in production 10 min on a call

Omitted 4 additional developer tools items from the main section; see raw data and source-specific sections below.

Infrastructure & Compute

🟢 lakehq/sail — Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads. — score 13 Sources: github_trending

Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.

Business & Funding

🟢 How can I check whether my paper follows the required ARR formatting before submission? [D] — score 12 Sources: reddit/r/MachineLearning

Last cycle, one of my research paper was rejected because of formatting issues. I recently heard from someone that there may be a tool or software called something like “aclpubcheck” that can be used to check whether a manuscript follows the required submission format correctly. Does anyone know the

Research Papers

🟢 Can Muon Fine-tune Adam-Pretrained Models? — score 38 Sources: huggingface · arxiv/cs.LG

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this

🟢 Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models — score 25 Sources: huggingface

Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics,

🟢 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models — score 25 Sources: huggingface

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance

🟢 RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark — score 25 Sources: huggingface

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task

Other Signals

🟢 Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models — score 38 Sources: reddit/r/LocalLLaMA

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can massively improve prompt processing throughput, as long as you also raise `--n-cp

🟢 Online RL Reading Group[D] — score 31 Sources: reddit/r/MachineLearning

Hi, I am a student going into my first year in Ph.D in RL this September. Although each university kinda has their own reading groups, I was wondering if there is active RL Online reading group I can participate. Sadly I couldnt find any info elsewhere. Does anyone have any information regarding Onl

🟢 I let AI build a tool to help me figure out what was waking me up at night — score 30 Sources: hackernews

🟢 Most RAG apps in production are confidently wrong and nobody talks about this enough — score 20 Sources: reddit/r/AIAgents

Been working with a few teams integrating RAG into internal tools, support bots, document Q&A, contract search, and I keep running into the same thing nobody warns you about when you're following tutorials. The basic retrieve-then-generate pipeline looks fine in demos. Clean question, clean doc,

🟢 Interaction Models from Thinking Machines Lab [P] — score 12 Sources: reddit/r/MachineLearning

Omitted 1 additional other signals items from the main section; see raw data and source-specific sections below.

Repo	Description	Stars Today	Language
garrytan/gstack	Use Garry Tan's exact Claude Code setup: 23 opinionated tools that serve as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA	918	typescript
wanshuiyin/Auto-claude-code-research-in-sleep	ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works with Claude Code, Codex, OpenClaw, or any LLM agent.	186	python
Zackriya-Solutions/meetily	Privacy first, AI meeting assistant with 4x faster Parakeet/Whisper live transcription, speaker diarization, and Ollama summarization built on Rust. 100% local processing. no cloud required. Meetily (Meetly Ai -https://meetily.ai) is the #1 Self-hosted, Open-source Ai meeting note taker for macOS & Windows.	140	rust
THU-MAIC/OpenMAIC	Open Multi-Agent Interactive Classroom — Get an immersive, multi-agent learning experience in just one click	130	typescript
romainsimon/paperasse	🇫🇷 Skills pour agents IA spécialisés dans la bureaucratie française : Comptable, Notaire, ...	110	python
jwadow/kiro-gateway	👻 Proxy API gateway for Kiro IDE & CLI (Amazon Q Developer / AWS CodeWhisperer). Use free Claude models with any client.	76	python
bytedance/UI-TARS	Pioneering Automated GUI Interaction with Native Agents	75	python
AUTOMATIC1111/stable-diffusion-webui	Stable Diffusion web UI	39	python
huggingface/skills	Give your agents the power of the Hugging Face ecosystem	38	python
RhysSullivan/executor	The missing integration layer for AI agents. Let them call any OpenAPI / MCP / GraphQL / custom js functions in secure environment.	35	typescript

📄 New Papers

Title	Category	Hotness	Link
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy	research_paper	36	Open
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding	research_paper	6	Open
FORTIS: Benchmarking Over-Privilege in Agent Skills	research_paper	2	Open
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language	research_paper	2	Open
Path-Coupled Bellman Flows for Distributional Reinforcement Learning	cs.AI	0	Open
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis	research_paper	2	Open
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits	cs.AI	0	Open
Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction	cs.AI	0	Open
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria	cs.AI	0	Open
Embeddings for Preferences, Not Semantics	cs.AI	0	Open
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective	cs.AI	0	Open
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs	cs.AI	0	Open
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents	cs.AI	0	Open
PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams	cs.AI	0	Open
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents	cs.AI	0	Open

🏢 Lab Blog Posts

OpenAI: How ChatGPT adoption broadened in early 2026

🐦 Twitter/X Highlights

Account	Tweet Summary
OpenAI	Introducing Daybreak: frontier AI for cyber defenders. Daybreak brings together the most capable OpenAI models, Codex, and our security partners to accelerate cyber defense and continuously secure software. A step toward a future where security teams can move at the speed defense demands. Post
AnthropicAI	Claude's Constitution is now an audiobook, read by two of its authors, Amanda Askell and Joe Carlsmith. It includes a Q&A on the writing process, the philosophies that shaped the document, and how it might change as models become more capable. Listen at http://anthropic.com/constitution Post
AnthropicAI	New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users. Since then, we’ve completely eliminated this behavior. How? Post
OpenAI	Today we’re launching the OpenAI Deployment Company to help businesses build and deploy AI. It's majority-owned and controlled by OpenAI. It brings together 19 leading investment firms, consultancies, and system integrators to help organizations deploy frontier AI to production for business impact. Post
simonw	Wrote about today's GitLab restructuring / "workforce reduction" announcement, and ended up digging around in version control for both the GitLab and the 37signals public employee handbooks to help illustrate my thoughts https://simonwillison.net/2026/May/11/gitlab-act-2/ Post
simonw	New TIL: I figured out how to use my LLM CLI tool in a shebang line, which means you can write executable scripts in English, or hook up more complex scripts with a snippet of YAML template Post

Repeated From Recent Briefings

NousResearch/hermes-agent — The agent that grows with you - first seen 2026-05-11
anthropics/financial-services - first seen 2026-05-07
farion1231/cc-switch — A cross-platform desktop All-in-One assistant for Claude Code, Codex, OpenCode, OpenClaw, Gemini CLI & Hermes Agent. Only official website: ccswitch.io - first seen 2026-05-08
PhD students in ML, how many hours on average do you work? [D] - first seen 2026-05-11
datawhalechina/hello-agents — 📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程 - first seen 2026-05-09
Openclaw ia trending down and will disappear soon - first seen 2026-05-11
bytedance/UI-TARS-desktop — The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra - first seen 2026-05-09
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models - first seen 2026-05-11
HKUDS/AI-Trader — "AI-Trader: 100% Fully-Automated Agent-Native Trading" - first seen 2026-05-02
earendil-works/pi — AI agent toolkit: coding agent CLI, unified LLM API, TUI & web UI libraries, Slack bot, vLLM pods - first seen 2026-05-09
... plus 131 more repeated items in processed data

AI Watchtower Briefing — 2026-05-12

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟡 Notable

Model Releases

Developer Tools

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Business & Funding

Research Papers

Other Signals

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning `gpt-oss-120b-F16.gguf` with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (`-ub`) can massively improve prompt processing throughput, as long as you also raise `--n-cp

📄 New Papers

🏢 Lab Blog Posts

🐦 Twitter/X Highlights

Repeated From Recent Briefings

AI Watchtower Briefing — 2026-05-12

🔴 High Significance

Model Releases

Developer Tools

Infrastructure & Compute

Research Papers

Other Signals

🟡 Notable

Model Releases

Developer Tools

Research Papers

Other Signals

🟢 Incremental

Model Releases

Developer Tools

Infrastructure & Compute

Business & Funding

Research Papers

Other Signals

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can massively improve prompt processing throughput, as long as you also raise `--n-cp

📈 Trending Repos

📄 New Papers

🏢 Lab Blog Posts

🐦 Twitter/X Highlights

Repeated From Recent Briefings

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning `gpt-oss-120b-F16.gguf` with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (`-ub`) can massively improve prompt processing throughput, as long as you also raise `--n-cp