Skip to main content

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.

What Is a LoRA (and Why Use One)?

LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.

  • Only a small fraction of parameters are updated
  • Training uses much less memory
  • The resulting adapters are tiny (often <1% of the model size)

These benefits make LoRA a go-to method for use cases where you want to:

  • Customize a base model per task or domain
  • Run many fine-tunes without retraining or duplicating the base model
  • Stay compatible with quantized or frozen weights

At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

· 4 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.

Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?

· 3 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.