Skip to main content

12 posts tagged with "ai-infrastructure"

View All Tags

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.

Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?

· 3 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.