Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts
Introduction
In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.
Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.
What Is a LoRA (and Why Use One)?
LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.
- Only a small fraction of parameters are updated
- Training uses much less memory
- The resulting adapters are tiny (often
<1%of the model size)
These benefits make LoRA a go-to method for use cases where you want to:
- Customize a base model per task or domain
- Run many fine-tunes without retraining or duplicating the base model
- Stay compatible with quantized or frozen weights
At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.
