Model Memory Management
One of the most important things to consider when running a model is how much memory it will use. This is especially important when running large models on a GPU, as the memory may be limited. If you run out of memory, the model will crash and may need to be restarted manually. This can be very frustrating, especially if you have deployed the model to a server and are running it remotely. If you have the luxury of access to multiple GPUs, these memory requirements can be (in effect) combined by taking advantage of Multi-GPU Deployment.
An interactive calculator is available on the GUI for calculating how much memory your model might require.
Factors affecting memory usage​
Memory usage can be split into two categories: static and dynamic. Static memory usage is the amount of memory that is taken up by the model itself, and is constant throughout the model's lifetime. Dynamic memory usage is the amount of memory that is used by the model during inference, and can vary depending on the input.
Static memory usage​
The static memory usage of a model is determined by the size of the model and this is dependent on two factors: the number of parameters and the precision of the model's weights.
- Number of parameters: This is the number of parameters in the model.
- Precision: This is the precision of the model's weights. Using a lower precision will reduce the size of the model, but may also reduce the model's accuracy.
Dynamic memory usage​
The dynamic memory usage of a model is determined mainly by the KV cache size. This is the amount of memory that is used to store the key-value pairs that are used to store the model's weights. The KV cache size is determined by the architecture of the model and the size of the model's weights.
To calculate the KV cache size, we use the following formula: