Skip to main content

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number. ‍## What Should I be Observing?

GPU Memory: What's Actually Using It?

At inference time, your GPU memory gets divided among three major components:

  • Model Weights - a fixed chunk, based on parameter count and precision
  • Activations - temporary tensors created during forward passes (often small and engine-managed)
  • KV Cache - memory that stores every token currently active in the system

For real-time or multi-user workloads, the KV cache is often the limiting factor. It's what determines whether a new user’s request can be served without delay, regardless of what your GPU utilization says.

The Core Calculation

Model Weights:

Scaling That Number

Once you understand the math, there are three main ways to increase your capacity:

1. Quantize the Model (and/or the KV Cache)

Reducing model precision shrinks the memory footprint of weights - and can sometimes reduce KV cache size if supported by your inference engine.

KV cache quantization is less common in production but can double or quadruple token capacity if supported. The tradeoff is increased, decoding latency unless fused dequantization kernels are available.

2. Increase Available VRAM

You can scale up or out:

  • Vertical scaling: Upgrade to higher VRAM GPUs (e.g., 24GB → 80GB → 128GB)
  • Horizontal scaling: Distribute the model across multiple GPUs using tensor parallelism or pipeline parallelism

More VRAM gives you a larger KV cache - and therefore more tokens to work with. Horizontal scaling introduces some duplication overhead and infrastructure complexity, but it’s often necessary at larger scale.

3. Offload the KV Cache

Some engines allow you to offload older KV layers to CPU or even disk, or keep only the last few layers on GPU. This can reduce GPU KV cache usage by 90%+.

The catch is latency. Unless your inference engine overlaps data movement with computation efficiently, you’ll see increased response times - so this is best used in workloads that prioritize token capacity over speed

Other Considerations

These calculations give you a strong first estimate, but real-world behavior varies depending on:

  • Inference engine - whether it supports paged attention, chunked prefill, quantized cache, etc.
  • Workload shape - are requests long, short, bursty, or streaming?
  • Fragmentation - fixed-size page allocation can leave some KV cache space unused
  • Decode behavior - Token generation is more memory bound than prefill - so decreasing the load won't necessarily improve response times.

If you’re tuning for production, these second-order factors can shift your real limits by 10–30%.

Conclusion

If you’re self-hosting LLMs and need to hit concurrency or latency targets, it’s critical to move from intuition to calculation. With just a few inputs - model size, VRAM, context length - you can:

  • Estimate concurrency limits
  • Choose the right model precision
  • Plan upgrades and scaling strategies
  • Tune memory settings per engine This is what determines whether you can serve 10 users - or 100.