Building a Kubernetes Deployment with Takeoff from Scratch
This guide does not include features like scaling, monitoring, continuous deployment, resilience, load balancing and so on. It is a guide to how you could start to build a cluster from scratch with Takeoff Engine. If you're interested in deploying large language models at scale, we recommend you contact us, and we can advise you further.
Large Language Models (LLMs) are a transformative new technology that have great potential to transform the way that we build software. They generate text, answer questions, and write code. However, deploying these models remains challenging due to their size and the substantial compute resources they require. This post is focused on using two infrastructure tools, Docker and Kubernetes, to deploy Takeoff, a docker image that bundles optimization and serving technology specifically designed for LLMs. We're following on from our primer where we give an introduction to Docker and Kubernetes, and explain how they can be used to deploy machine learning models.
For all the code used in this post, see our GitHub repository.
Docker and Takeoff​
Introducing Docker and Takeoff​
As we discussed in our previous post, Docker is a software package that helps us build, run, and manage containers. It simplifies the process of bundling up processes and running them in a way that's isolated from the rest of the system.
Takeoff is a Docker image that comes pre-packaged with large language model (LLM) compression and optimization technology, combined with a low-latency inference server. It is designed to allow you to quickly and easily deploy a large language model in a container.
Getting Started with Takeoff​
To run Takeoff locally, you will need to install Docker. If you have a GPU on your machine, you should also install the NVIDIA Container Toolkit, which allows containers running on your machine to use your GPU. Once Docker is set up, you can run Takeoff with the following command:
- CPU only
- GPU enabled
docker run \
-e TAKEOFF_MODEL_NAME='google/flan-t5-small' \
-p 8080:80 \
tytn/fabulinus:latest-cpu
docker run \
--gpus all \
-e TAKEOFF_MODEL_NAME='google/flan-t5-small' \
-p 8080:80 \
tytn/fabulinus:latest
In the above command, we set the TAKEOFF_MODEL_NAME
environment variable to the name of the model we want to deploy. This instructs Takeoff to download and optimize this model when the container starts. The -p 8080:80
flag maps port 80 inside the container to port 8080 on the host machine, allowing us to access the inference server.
Calling the server​
Once we've used docker to boot it, call the server with curl as follows:
curl -X POST \
-H "Content-Type: application/json" \
-N \
-d '{"text": "Hello world!"}' \
http://localhost:8080/generate_stream
Deploying Takeoff with docker is somewhat unnecessary - since the iris launcher comes packaged with all the tools required to launch and manage these docker containers. Where an understanding of how to use this container is useful is when you want to deploy it more robustly. It's this we'll talk about in the next section, where we'll discuss Kubernetes
Deploying Takeoff with Kubernetes​
Setting Up Kubernetes with MicroK8s​
Kubernetes is a container orchestrator: it manages the deployment and scaling of containers across a cluster of machines. Those machines can have heterogeneous resources: GPUs, CPUs, RAM, etc. Kubernetes makes horizontal and vertical scaling of containers easy, and allows us to deploy containers in a way that is resilient to failure. Kubernetes comes with autoscaling built in, so we can scale our deployment up and down automatically based on demand. It also provides a number of other useful features, such as load balancing and service discovery.
In order to try Kubernetes out locally, we need to set up a Kubernetes cluster on our machine. In production, we'd recommend a managed Kubernetes service such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service. Such a solution would give us a cluster of machines with Kubernetes pre-installed, and would allow us to easily scale our cluster up and down as needed.
Developing locally for such a deployment with Kubernetes can be a pain. One of the easiest ways to do this locally is using MicroK8s, a lightweight Kubernetes distribution designed for local deployment and testing. You can find instructions for installing MicroK8s here.
There are a number of other options for running Kubernetes locally, including Minikube, Docker Desktop and kind. We're using microk8s here, since it's the most painless option for enabling GPU support on a local Kubernetes cluster.
GPU support in microk8s is bundled in an addon called gpu
, which you can install with the following command:
microk8s enable gpu
Kubernetes Deployment Spec​
Once MicroK8s is installed, we can write our Kubernetes deployment specification. This is a YAML file that tells Kubernetes what we want to deploy, and how we want it to be managed. Here's a simple example for our Takeoff container:
Kubernetes has a number of different resource types. For an overview of resource types and how they interact with one another, see our primer.
- CPU only
- GPU enabled
apiVersion: apps/v1
kind: Deployment
metadata:
name: takeoff-engine
spec:
replicas: 1
selector:
matchLabels:
app: takeoff-engine
template:
metadata:
labels:
app: takeoff-engine
spec:
containers:
- name: takeoff-engine
image: tytn/fabulinus:latest-cpu
env:
- name: TAKEOFF_MODEL_NAME
value: "google/flan-t5-small"
ports:
- containerPort: 80
apiVersion: apps/v1
kind: Deployment
metadata:
name: takeoff-engine
spec:
replicas: 1
selector:
matchLabels:
app: takeoff-engine
template:
metadata:
labels:
app: takeoff-engine
spec:
containers:
- name: takeoff-engine
image: tytn/fabulinus:latest
env:
- name: TAKEOFF_MODEL_NAME
value: "google/flan-t5-small"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 80
This deployment spec tells Kubernetes to create a deployment named 'takeoff-engine', with one replica (i.e., one pod).
Each pod will run our Takeoff image.
If we've got a GPU, available, we specify that the pod should have access to one GPU in the resources section.
If we don't we use the -cpu image, to speed up the container download (since it doesn't have to package the GPU drivers).
It also sets the TAKEOFF_MODEL_NAME
environment variable, just like in our Docker command earlier.
Create the file, and then apply it to your Kubernetes cluster using the following command:
- CPU only
- GPU enabled
kubectl apply -f deployment-cpu.yaml
kubectl apply -f deployment-gpu.yaml
If you run kubectl get pods
, you should see a pod starting takeoff-engine-...
running on your cluster.
Useful Commands
- To get the logs for the pod, run
microk8s kubectl logs takeoff-engine-...
. - To delete the pod, run
microk8s kubectl delete pod takeoff-engine-...
(though, since we used a deployment, the deployment controller will immediately bring up a replacement). - To delete the deployment, run
microk8s kubectl delete deployment takeoff-engine
. - To get information about the pod, run
microk8s kubectl describe pod takeoff-engine-...
.
if you're not using microk8s and/or have installed and authenticated the kubectl
command line tool, you should omit the microk8s
prefix from the above commands
Talking to Takeoff: Creating a Service​
Services give a set of pods a consistent network identity. This is a stable endpoint that forwards traffic to one or more pods. Here's a simple example of a service for our deployment:
apiVersion: v1
kind: Service
metadata:
name: takeoff-engine-service
spec:
selector:
app: takeoff-engine
ports:
- protocol: TCP
port: 8080
targetPort: 80
type: ClusterIP
This service forwards traffic from port 8080 on the service to port 80 on any pods that match the app: takeoff-engine
label.
The service will distribute requests that it receives among all the matching pods, providing a simple form of load balancing.
This means that we can trivially horizontally scale our deployment by increasing the number of replicas in our deployment spec, while keeping the service the same.
Now we have a service deployed, we can send requests to our Takeoff container.1 Try it out now by running the following command:
kubectl port-forward service/takeoff-engine-service 8080:8080
This will forward traffic from port 8080 on your local machine to port 8080 on the service. Then, run the following command in a separate terminal:
curl -X POST \
-H "Content-Type: application/json" \
-N \
-d '{"text": "Hello world!"}' \
http://localhost:8080/generate_stream
You should see your models answer to the prompt "Hello world!", stream back to your terminal.
Rolling Out a New Version​
To roll out a new version of our model, we simply change the TAKEOFF_MODEL_NAME
environment variable in our deployment spec, and apply the updated spec using
- CPU only
- GPU enabled
kubectl apply -f deployment-cpu.yaml
kubectl apply -f deployment-gpu.yaml
Kubernetes will then automatically update our pods to reflect the change.
For example, if we wanted to change the model to 'tiiuae/falcon-7b-instruct', we would change the value
field under TAKEOFF_MODEL_NAME
to this new model name, save our deployment spec, and apply it.
The same applies to any other changes we want to make to our deployment spec.
To scale up the number of replicas, we simply change the replicas
field in our deployment spec, and apply it again.
Packaging it up: Helm Charts (Optional)​
We've now got a working deployment of our Takeoff container, but we can make it even easier to deploy by packaging it up as a Helm chart. Helm is a package manager for Kubernetes, that allows us to define a set of Kubernetes resources, and then install them with a single command. This is especially useful for deploying complex applications, that require multiple Kubernetes resources to be deployed together. We can also use its powerful templating abilities to make our chart configurable, so that we can easily deploy multiple instances of our Takeoff container, with different models, or different numbers of replicas.
Lets start by creating a new directory for our chart:
mkdir takeoff-engine-chart
cd takeoff-engine-chart
Now, create a file called Chart.yaml
in this directory, and add the following contents:
apiVersion: v2
name: takeoff-engine
description: A Helm chart for deploying Takeoff
type: application
version: 0.1.0
appVersion: 0.1.0
This file contains some basic metadata about our chart, including its name, description, and version.
Next, we need to define the Kubernetes resources that make up our chart.
Create a new directory called templates
, and create the following files in it:
Add the following contents to this file:
- templates/deployment.yaml
- templates/service.yaml
- values.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: takeoff-engine
spec:
replicas: 1
selector:
matchLabels:
app: takeoff-engine
template:
metadata:
labels:
app: takeoff-engine
spec:
containers:
- name: takeoff-engine
image: {{ if .Values.gpu }}tytn/fabulinus:latest{{ else }}tytn/fabulinus:latest-cpu{{ end }}
ports:
- containerPort: 80
env:
- name: TAKEOFF_MODEL_NAME
value: {{ .Values.modelName }}
{{- if .Values.gpu }}
resources:
limits:
nvidia.com/gpu: 1
{{- end }}
apiVersion: v1
kind: Service
metadata:
name: takeoff-engine-service
spec:
selector:
app: takeoff-engine
ports:
- protocol: TCP
port: 8080
targetPort: 80
type: ClusterIP
modelName: tiiuae/falcon-7b-instruct
gpu: false
Helm lets us separate our configuration from our Kubernetes resources, by using a templating language called Go Templates.
Installing the Chart​
Lets start by clearing up our existing deployment:
- CPU only
- GPU enabled
kubectl delete -f deployment-cpu.yaml
kubectl delete -f service.yaml
kubectl delete -f deployment-gpu.yaml
kubectl delete -f service.yaml
Now, to redeploy with our new chart, run the following command:
helm install takeoff-engine .
To install the chart with a different model, we can override the modelName
by setting the value on the command line:
helm install takeoff-engine . --set modelName=google/flan-t5-small
Adding persistence: PVCs (Optional)​
To make sure that models persist across container restarts, we can use persistent volumes. Modify the helm chart to add a PVC for the model cache, and use a statefulset in favour of a deployment. Make persistence configurable in values.yaml
- templates/statefulset.yaml
- templates/service.yaml
- values.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: takeoff-engine
spec:
replicas: 1
selector:
matchLabels:
app: takeoff-engine
serviceName: takeoff-engine
template:
metadata:
labels:
app: takeoff-engine
spec:
containers:
- name: takeoff-engine
image: {{ if .Values.gpu }}tytn/fabulinus:latest{{ else }}tytn/fabulinus:latest-cpu{{ end }}
ports:
- containerPort: 80
env:
- name: TAKEOFF_MODEL_NAME
value: {{ .Values.modelName }}
{{- if .Values.gpu }}
resources:
limits:
nvidia.com/gpu: 1
{{- end }}
{{- if .Values.persistence }}
volumeMounts:
- name: model-cache
mountPath: /.takeoff_cache
{{- end }}
{{- if .Values.persistence }}
volumeClaimTemplates:
- metadata:
name: model-cache
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: {{ .Values.cacheSize }}
{{- end }}
apiVersion: v1
kind: Service
metadata:
name: takeoff-engine-service
spec:
selector:
app: takeoff-engine
ports:
- protocol: TCP
port: 8080
targetPort: 80
type: ClusterIP
modelName: tiiuae/falcon-7b-instruct
gpu: false
persistence: false
cacheSize: 40Gi
Conclusions​
And that's it! We now have a deployed Takeoff container, managed by Kubernetes, ready to handle inference requests for our model. Using Docker and Kubernetes, we've made the complex task of deploying a large language model manageable, repeatable, and scalable. We're excited to see what you build with Takeoff!
If you have any questions, comments, or feedback on the Takeoff server, please reach out to us on our discord server. For help with LLM deployment in general, or to signup for the pro version of the Takeoff Inference Server, with features like automatic batching, multi-gpu inference, monitoring, authorization, and more, please reach out at hello@titanml.co.
For all the code used in this post, see our GitHub repository.
Footnotes​
-
We could have sent requests directly to the pods, but the service provides a natural interface for load balancing and scaling. ↩