Skip to main content

LLMs in Production: Docker and Kubernetes for Machine Learning

· 4 min read
Fergus Finn

In the present age, large language models are shaping our world in ways we never anticipated. They generate text, answer questions, and are even writing code. The power they possess to revolutionize the way we live our lives is profound. However, deploying these behemoths is a challenge. They're big, they demand significant compute resources to function, and the field of MLOps, which focuses on applying DevOps practices to machine learning workflows, is complex and still being explored.

In this blog post, we're going to introduce a crucial building block of modern MLOps - the container - and dive into a popular container orchestrator called Kubernetes. Let's start our journey into this exciting world.

Containers: What Are They?​

Understanding Containers​

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. It's a lightweight way of bundling up one or several processes on your computer, isolating them from the rest of the system. What makes them even more appealing is their ability to specify hard limits on the CPU and memory resources accessible to these processes. The isolation of processes, coupled with a filesystem, can be bundled into a single object that can run on any computer with the container runtime installed, making it portable.

The Appeal of Containers for ML​

Deploying ML models and keeping them running can be quite a task. Containers make it simpler. Machine learning is a field known for its infamous dependency hell. Ensuring your machine learning framework cooperates with your hardware, packages, and your operating system often becomes a nightmare. That's where containers come into play. By bundling all of these dependencies into a single reproducible object, the problem is largely mitigated. The critical aspect is the reproducibility: if our packaged ML container runs on our machine, we can be more confident that it will also run on our deployment infrastructure.

Docker: A container manager​

Docker is a platform used to develop, ship, and run applications with the help of containerization. It comprises several components, two of which are significant for our understanding: the container runtime and the command-line interface. Docker utilises these to enable the building, running, and management of containers.

Kubernetes: The Container's Operating System​

To understand Kubernetes, we need to first identify the problems it aims to solve. Consider an Operating System (OS). Its primary role is to manage the resources of a computer and ensure smooth running of processes. The operating system is what takes the programs created by programmers (a set of instructions) and runs them as processes. And it's not just about running processes - it also needs to ensure they don't interfere with each other, and manage their execution in a concurrent fashion, providing the illusion of parallelism1.

Kubernetes plays a similar role, but for containers. Here's why it's vital:

  1. It ensures better isolation for processes: Processes can be leaky and interfere with each other. Kubernetes helps manage this issue.
  2. It enables portability: Unlike processes that are tied to the OS they are running on, containers can be run on different operating systems without the need to recompile. This is beneficial for developers and users alike.
  3. It helps manage heterogeneity in ML systems: Tasks in ML can range from CPU intensive, to those requiring GPUs, large amounts of memory or storage. The optimal machine for each kind of task varies. Kubernetes, when combined with cloud computing, can manage this heterogeneity, assigning tasks to a machine with a GPU only when required.

Containerization and ML deployment: Titan Takeoff​

In this article, we've discussed the importance of containers and container orchestration for ML deployment. In our subsequent post, we will introduce you to TitanML Takeoff, a Docker image that packages LLM compression and optimization technology with a low latency inference server. It aims to simplify the process of deploying a large language model in a container. We'll discuss how to use it, and how to deploy it on Kubernetes.

Footnotes​

  1. In practice, on modern PCs multiple processes are running in parallel. But many fewer than the total number that are running at any one time: hence the need for concurrency. ↩