Skip to main content
Version: 0.11.x

Model memory management

One of the most important things to consider when running a model is how much memory it will use. This is especially important when running large models on a GPU, as the memory may be limited. If you run out of memory, the model will crash and may need to be restarted manually. This can be very frustrating, especially if you have deployed the model to a server and are running it remotely. If you have the luxury of access to multiple GPUs, these memory requirements can be (in effect) combined by taking advantage of multi-gpu deployment.

An interactive calculator is available on the GUI for calculating how much memory your model might require.

Factors affecting memory usage​

Memory usage can be split into two categories: static and dynamic. Static memory usage is the amount of memory that is taken up by the model itself, and is constant throughout the model's lifetime. Dynamic memory usage is the amount of memory that is used by the model during inference, and can vary depending on the input.

Static memory usage​

The static memory usage of a model is determined by the size of the model and this is dependent on two factors: the number of parameters and the precision of the model's weights.

  • Number of parameters: This is the number of parameters in the model.
  • Precision: This is the precision of the model's weights. Using a lower precision will reduce the size of the model, but may also reduce the model's accuracy.

Dynamic memory usage​

The dynamic memory usage of a model is determined mainly by the KV cache size. This is the amount of memory that is used to store the key-value pairs that are used to store the model's weights. The KV cache size is determined by the architecture of the model and the size of the model's weights.

To calculate the KV cache size, we use the following formula:

KVCache(KV)=2×2×layers×hiddensize×sequencelength×batchsize\mathrm{KV Cache}(KV) = 2 \times 2 \times \mathrm{layers} \times \mathrm{hidden size} \times \mathrm{sequence length} \times \mathrm{batch size}
  • 2: One key and one value per key-value pair.
  • 2: Number of bytes in 16-bit format usually used for weights.
  • Layers: Number of layers in the model.
  • Hidden size: Product of number of heads and head size.
  • Sequence length: Number of tokens in the input and output sequence.
  • Batch size: Number of requests to be processed in parallel.

Picking a model that fits

Smaller models vs Quantized models​

In the case that model size > available memory, you typically have four choices.

  • Get more GPUs and then use multi-gpu deployment (expensive)
  • Get a GPU with more available memory (expensive)

And two more practical ones...

  • Use a smaller model (e.g. llama-13b instead of llama-70b)
  • Quantize the larger model.

Quantization​

Quantization works by reducing the precision of individual weights to reduce the memory needed to store these weights. Typically weights are represented by floating-point numbers in 32-bit, but can be converted into integers in 4-bit. There is, naturally, a loss of information here which in turn reduces model performance. The trick is performing this quantization in such a way as to minimize information loss.

Consensus suggests that for a fixed amount of memory, it's preferable to quantize a larger model to rather than use a smaller one in full-precision. This effect diminishes past a certain point of quantization, with 4-bit being the sweet-spot for AWQ models.

Compression without performance loss​

Using 4-bit quantisation leads to an 8x compression over standard 32-bit models. With a technique called Activation Aware Quantization, models can be compressed aggressively without much loss in model performance. This technology safeguards critical model weights from quantisation errors, enabling speedy inference with heavily compressed models.

Read more about our use of quantization here.

Using AWQ models​

A range of popular models have been converted to AWQ and made available here. Get in contact if you wish to convert a different model to be quantized as we can usually help out with this.