Skip to main content

Inference Optimization: Why GPUs for machine learning?

· 6 min read
Fergus Finn

NVIDIA's stock price recently hit record levels1, on an earnings report that showed their data center sales had gone through the roof. Those datacenter units were sold to companies trying to produce AI enabled applications. But why has AI led to this rush to buy GPUs? Why Graphics Processing Units? The answer lies in their potential for parallelising machine learning workloads by dividing them up and allowing multiple operations to be undertaken simultaneously.

Large Language Models​

We can answer this question by looking to language models, which can be thought of as a sophisticated tool designed to work with text. To illustrate, consider the autoregressive language models, whose primary task is to read a piece of text and predict the most fitting continuation.

Click the button below to see an example of an autoregressive language model in action2.

example
The quick brown fox

In order to achieve this, the language model will need to convert the input text into a list of numbers called a vector, that stores a information about a word. This process is called tokenization and is important as computers don't understand language the way humans do and can't intuitively know the meaning or sentiment of a word.

However, they are excellent at handling numbers. So, by representing words as vectors that encapsulate textual information such as semantic meaning, similarities with other words, contextual information, grammatical properties, we can feed this information into machine learning models that can then process, analyse, or even generate language.

What is Matrix Multiplication?​

At the heart of machine learning lies an operation called matrix multiplication, which underpin many of the key operations used in machine learning. Matrix multiplication is the process of taking a grid of numbers called a matrix and using it to transform the input text vector from one vector to another.

A:1234B:56=C:Result:

This transformation turns one representation of our input text into another, rotating and skewing it in space until it looks completely different. By transforming the input text in this way (interspersed with simple nonlinear transformations), we can capture the process of generating new text from old, by viewing it as a complicated transformation in a high-dimensional space.

When it comes to the forward operation of a machine learning model, the most resource-intensive step is computing the results of matrix multiplications[2]. This is where the role of GPUs becomes pivotal. Now, it's important to understand that matrix multiplications have a unique characteristic: they're inherently parallelisable.

In the example above, clicking the "Next Step" button only calculates a single element of the output vector. Yet, each single calculation isn't dependent on the other. This means, if we have N computing units available, we could potentially compute N elements simultaneously, leading to a significant boost in the model's operational speed.

Here's where the difference between CPUs and GPUs becomes evident. CPUs are primarily designed to execute a limited set of operations at lightning speed, making them unsuitable for such parallel tasks. GPUs, however, are specifically engineered for these extensive parallel workloads, making them indispensable in the realm of machine learning. Thus, the solution to the NVIDIA mystery.

GPU Types - Which ones to get?​

Why choose NVIDIA when there are numerous GPU providers out there? The consistent preference for NVIDIA in the machine learning arena can be attributed to its software. NVIDIA's CUDA software stack stands out as the most mature and widely-adopted platform. Notably, it seamlessly integrates with modern deep learning libraries like PyTorch, JAX, and Tensorflow. Programming with CUDA is straightforward, and the powerful abstraction layers built atop it make the process even more efficient.

NVIDIA manufactures two distinct types of GPUs: those designed for consumers and those tailored for data centers. The most recent and advanced consumer GPU series for deep learning is the RTX 40xx. On the other hand, NVIDIA's datacenter GPUs, which are available through cloud providers, represent a pricier yet significantly more potent option.

The A100, for exmple, is a previous generation datacenter GPU that was foundational in the training and inference of Large Language Models. The latest generation, the H100, is even more powerful. If you are looking for a comprehensive analysis on which consumer GPU to invest in for machine learning development, you can read more about it here.

Why isn't my model running on my GPU?​

The most common and most dreaded experience people have when working with deep learning on GPUs is the Out Of Memory (OOM) error. This occurs when the model that you're trying to work with is too large for the memory on your GPU.

So what are your options when you get an OOM error? To most people, the most straightforward option is to procure a better GPU or rent one from a cloud provider, but this is often costly and unneccessary. The more sustainable alternative is to optimise your model.

This refers to the process of making your model smaller, faster, and more efficient. There are many different inference optimisation techniques that we use to bring you the best performance on our Titan Takeoff Server. As this is a huge topic, and we'll be writing more about it in the future, so do stay tuned!

Conclusions​

In this post, we've seen how GPUs are the best option for machine learning workloads. We've talked about what GPUs are available, and how to choose between them. Finally, we've talked about the importance of inference optimization, to make sure that your model is running as efficiently as possible.

Footnotes​

  1. https://edition.cnn.com/2023/08/23/tech/nvidia-earnings-ai-chips/index.html ↩

  2. Simplified: in practise, the generated fragments don't correspond to words, but instead text fragments, called tokens: this process is called tokenization. For an example of how words are broken down, see openAI's tokenization demo. ↩