LLMs in Production: Deploying the TitanML Takeoff Server on Kubernetes

August 16, 2023 · 12 min read

CTO, TitanML

Large Language Models (LLMs) are a transformative new technology that have great potential to transform the way that we build software. They generate text, answer questions, and write code. However, deploying these models remains challenging due to their size and the substantial compute resources they require. This post is focused on using two infrastructure tools, Docker and Kubernetes, to deploy Titan Takeoff, a docker image that bundles optimization and serving technology specifically designed for LLMs. We're following on from our primer where we give an introduction to Docker and Kubernetes, and explain how they can be used to deploy machine learning models.

For all the code used in this post, see our GitHub repository.

Docker and Titan Takeoff

Introducing Docker and Titan Takeoff

As we discussed in our previous post, Docker is a software package that helps us build, run, and manage containers. It simplifies the process of bundling up processes and running them in a way that's isolated from the rest of the system.

Titan Takeoff is a Docker image that comes pre-packaged with large language model (LLM) compression and optimization technology, combined with a low-latency inference server. It is designed to allow you to quickly and easily deploy a large language model in a container.

Getting Started with Titan Takeoff

To run Titan Takeoff locally, you will need to install Docker. If you have a GPU on your machine, you should also install the NVIDIA Container Toolkit, which allows containers running on your machine to use your GPU. Once Docker is set up, you can run Titan Takeoff with the following command:

CPU only
GPU enabled

docker run \
       -e TAKEOFF_MODEL_NAME='google/flan-t5-small' \
       -p 8080:80 \
       tytn/fabulinus:latest-cpu

docker run \
       --gpus all \
       -e TAKEOFF_MODEL_NAME='google/flan-t5-small' \
       -p 8080:80 \
       tytn/fabulinus:latest

In the above command, we set the TAKEOFF_MODEL_NAME environment variable to the name of the model we want to deploy. This instructs Titan Takeoff to download and optimize this model when the container starts. The -p 8080:80 flag maps port 80 inside the container to port 8080 on the host machine, allowing us to access the inference server.

Calling the server

Once we've used docker to boot it, call the server with curl as follows:

curl -X POST \
     -H "Content-Type: application/json" \
     -N \
     -d '{"text": "Hello world!"}' \
     http://localhost:8080/generate_stream

Deploying Takeoff with docker is somewhat unnecessary - since the iris launcher comes packaged with all the tools required to launch and manage these docker containers. Where an understanding of how to use this container is useful is when you want to deploy it more robustly. It's this we'll talk about in the next section, where we'll discuss Kubernetes

Deploying Titan Takeoff with Kubernetes

Setting Up Kubernetes with MicroK8s

Kubernetes is a container orchestrator: it manages the deployment and scaling of containers across a cluster of machines. Those machines can have heterogeneous resources: GPUs, CPUs, RAM, etc. Kubernetes makes horizontal and vertical scaling of containers easy, and allows us to deploy containers in a way that is resilient to failure. Kubernetes comes with autoscaling built in, so we can scale our deployment up and down automatically based on demand. It also provides a number of other useful features, such as load balancing and service discovery.

In order to try Kubernetes out locally, we need to set up a Kubernetes cluster on our machine. In production, we'd recommend a managed Kubernetes service such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service. Such a solution would give us a cluster of machines with Kubernetes pre-installed, and would allow us to easily scale our cluster up and down as needed.

Developing locally for such a deployment with Kubernetes can be a pain. One of the easiest ways to do this locally is using MicroK8s, a lightweight Kubernetes distribution designed for local deployment and testing. You can find instructions for installing MicroK8s here.

Other options

There are a number of other options for running Kubernetes locally, including Minikube, Docker Desktop and kind. We're using microk8s here, since it's the most painless option for enabling GPU support on a local Kubernetes cluster.

GPU support in microk8s is bundled in an addon called gpu, which you can install with the following command:

microk8s enable gpu

Kubernetes Deployment Spec

Once MicroK8s is installed, we can write our Kubernetes deployment specification. This is a YAML file that tells Kubernetes what we want to deploy, and how we want it to be managed. Here's a simple example for our Titan Takeoff container:

Kubernetes resources

Kubernetes has a number of different resource types. For an overview of resource types and how they interact with one another, see our primer.

CPU only
GPU enabled

apiVersion: apps/v1
kind: Deployment
metadata:
  name: titan-takeoff
spec:
  replicas: 1
  selector:
    matchLabels:
      app: titan-takeoff
  template:
    metadata:
      labels:
        app: titan-takeoff
    spec:
      containers:
        - name: titan-takeoff
          image: tytn/fabulinus:latest-cpu
          env:
            - name: TAKEOFF_MODEL_NAME
              value: "google/flan-t5-small"
          ports:
            - containerPort: 80

apiVersion: apps/v1
kind: Deployment
metadata:
  name: titan-takeoff
spec:
  replicas: 1
  selector:
    matchLabels:
      app: titan-takeoff
  template:
    metadata:
      labels:
        app: titan-takeoff
    spec:
      containers:
        - name: titan-takeoff
          image: tytn/fabulinus:latest
          env:
            - name: TAKEOFF_MODEL_NAME
              value: "google/flan-t5-small"
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 80

This deployment spec tells Kubernetes to create a deployment named 'titan-takeoff', with one replica (i.e., one pod). Each pod will run our Titan Takeoff image. If we've got a GPU, available, we specify that the pod should have access to one GPU in the resources section. If we don't we use the -cpu image, to speed up the container download (since it doesn't have to package the GPU drivers). It also sets the TAKEOFF_MODEL_NAME environment variable, just like in our Docker command earlier.

Create the file, and then apply it to your Kubernetes cluster using the following command:

CPU only
GPU enabled

kubectl apply -f deployment-cpu.yaml

kubectl apply -f deployment-gpu.yaml

If you run kubectl get pods, you should see a pod starting titan-takeoff-... running on your cluster.

Useful Commands

To get the logs for the pod, run microk8s kubectl logs titan-takeoff-....
To delete the pod, run microk8s kubectl delete pod titan-takeoff-... (though, since we used a deployment, the deployment controller will immediately bring up a replacement).
To delete the deployment, run microk8s kubectl delete deployment titan-takeoff.
To get information about the pod, run microk8s kubectl describe pod titan-takeoff-....

if you're not using microk8s and/or have installed and authenticated the kubectl command line tool, you should omit the microk8s prefix from the above commands

Talking to Takeoff: Creating a Service

Services give a set of pods a consistent network identity. This is a stable endpoint that forwards traffic to one or more pods. Here's a simple example of a service for our deployment:

apiVersion: v1
kind: Service
metadata:
  name: titan-takeoff-service
spec:
  selector:
    app: titan-takeoff
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
  type: ClusterIP

This service forwards traffic from port 8080 on the service to port 80 on any pods that match the app: titan-takeoff label. The service will distribute requests that it receives among all the matching pods, providing a simple form of load balancing. This means that we can trivially horizontally scale our deployment by increasing the number of replicas in our deployment spec, while keeping the service the same.

Now we have a service deployed, we can send requests to our Titan Takeoff container.¹ Try it out now by running the following command:

kubectl port-forward service/titan-takeoff-service 8080:8080

This will forward traffic from port 8080 on your local machine to port 8080 on the service. Then, run the following command in a separate terminal:

curl -X POST \
     -H "Content-Type: application/json" \
     -N \
     -d '{"text": "Hello world!"}' \
     http://localhost:8080/generate_stream

You should see your models answer to the prompt "Hello world!", stream back to your terminal.

Rolling Out a New Version

To roll out a new version of our model, we simply change the TAKEOFF_MODEL_NAME environment variable in our deployment spec, and apply the updated spec using

CPU only
GPU enabled

kubectl apply -f deployment-cpu.yaml

kubectl apply -f deployment-gpu.yaml

Kubernetes will then automatically update our pods to reflect the change. For example, if we wanted to change the model to 'tiiuae/falcon-7b-instruct', we would change the value field under TAKEOFF_MODEL_NAME to this new model name, save our deployment spec, and apply it. The same applies to any other changes we want to make to our deployment spec. To scale up the number of replicas, we simply change the replicas field in our deployment spec, and apply it again.

Packaging it up: Helm Charts (Optional)

We've now got a working deployment of our Titan Takeoff container, but we can make it even easier to deploy by packaging it up as a Helm chart. Helm is a package manager for Kubernetes, that allows us to define a set of Kubernetes resources, and then install them with a single command. This is especially useful for deploying complex applications, that require multiple Kubernetes resources to be deployed together. We can also use its powerful templating abilities to make our chart configurable, so that we can easily deploy multiple instances of our Titan Takeoff container, with different models, or different numbers of replicas.

Lets start by creating a new directory for our chart:

mkdir titan-takeoff-chart
cd titan-takeoff-chart

Now, create a file called Chart.yaml in this directory, and add the following contents:

Chart.yaml
apiVersion: v2
name: titan-takeoff
description: A Helm chart for deploying Titan Takeoff
type: application
version: 0.1.0
appVersion: 0.1.0

This file contains some basic metadata about our chart, including its name, description, and version. Next, we need to define the Kubernetes resources that make up our chart. Create a new directory called templates, and create the following files in it: Add the following contents to this file:

templates/deployment.yaml
templates/service.yaml
values.yaml

templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: titan-takeoff
spec:
  replicas: 1
  selector:
    matchLabels:
      app: titan-takeoff
  template:
    metadata:
      labels:
        app: titan-takeoff
    spec:
      containers:
        - name: titan-takeoff
          image: {{ if .Values.gpu }}tytn/fabulinus:latest{{ else }}tytn/fabulinus:latest-cpu{{ end }}
          ports:
            - containerPort: 80
          env:
            - name: TAKEOFF_MODEL_NAME
              value: {{ .Values.modelName }}
          {{- if .Values.gpu }}
          resources:
            limits:
              nvidia.com/gpu: 1
          {{- end }}

templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: titan-takeoff-service
spec:
  selector:
    app: titan-takeoff
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
  type: ClusterIP

values.yaml
modelName: tiiuae/falcon-7b-instruct
gpu: false

Helm lets us separate our configuration from our Kubernetes resources, by using a templating language called Go Templates.

Installing the Chart

Lets start by clearing up our existing deployment:

CPU only
GPU enabled

kubectl delete -f deployment-cpu.yaml
kubectl delete -f service.yaml

kubectl delete -f deployment-gpu.yaml
kubectl delete -f service.yaml

Now, to redeploy with our new chart, run the following command:

helm install titan-takeoff .

To install the chart with a different model, we can override the modelName by setting the value on the command line:

helm install titan-takeoff . --set modelName=google/flan-t5-small

Adding persistence: PVCs (Optional)

To make sure that models persist across container restarts, we can use persistent volumes. Modify the helm chart to add a PVC for the model cache, and use a statefulset in favour of a deployment. Make persistence configurable in values.yaml

templates/statefulset.yaml
templates/service.yaml
values.yaml

templates/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: titan-takeoff
spec:
  replicas: 1
  selector:
    matchLabels:
      app: titan-takeoff
  serviceName: titan-takeoff
  template:
    metadata:
      labels:
        app: titan-takeoff
    spec:
      containers:
        - name: titan-takeoff
          image: {{ if .Values.gpu }}tytn/fabulinus:latest{{ else }}tytn/fabulinus:latest-cpu{{ end }}
          ports:
            - containerPort: 80
          env:
            - name: TAKEOFF_MODEL_NAME
              value: {{ .Values.modelName }}
          {{- if .Values.gpu }}
          resources:
            limits:
              nvidia.com/gpu: 1
          {{- end }}
          {{- if .Values.persistence }}
          volumeMounts:
            - name: model-cache
              mountPath: /.takeoff_cache
          {{- end }}
  {{- if .Values.persistence }}
  volumeClaimTemplates:
    - metadata:
        name: model-cache
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: {{ .Values.cacheSize }}
  {{- end }}

templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: titan-takeoff-service
spec:
  selector:
    app: titan-takeoff
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
  type: ClusterIP

values.yaml
modelName: tiiuae/falcon-7b-instruct
gpu: false
persistence: false
cacheSize: 40Gi

Conclusions

And that's it! We now have a deployed Titan Takeoff container, managed by Kubernetes, ready to handle inference requests for our model. Using Docker and Kubernetes, we've made the complex task of deploying a large language model manageable, repeatable, and scalable. We're excited to see what you build with Titan Takeoff!

If you have any questions, comments, or feedback on the Titan Takeoff server, please reach out to us on our discord server. For help with LLM deployment in general, or to signup for the pro version of the Titan Takeoff Inference Server, with features like automatic batching, multi-gpu inference, monitoring, authorization, and more, please reach out at hello@titanml.co.

For all the code used in this post, see our GitHub repository.

We could have sent requests directly to the pods, but the service provides a natural interface for load balancing and scaling. ↩

Docker and Titan Takeoff​

Introducing Docker and Titan Takeoff​

Getting Started with Titan Takeoff​

Calling the server​

Deploying Titan Takeoff with Kubernetes​

Setting Up Kubernetes with MicroK8s​

Kubernetes Deployment Spec​

Talking to Takeoff: Creating a Service​

Rolling Out a New Version​

Packaging it up: Helm Charts (Optional)​

Installing the Chart​

Adding persistence: PVCs (Optional)​

Conclusions​

Footnotes​