Best practises for serving Takeoff in production: How to setup a reverse proxy
Serving large scale systems in production is hard. To do so effectively, efficiently and safely, users must solve many problems:
- Authentication (verifying who a user is)
- Authorisation (verifying what a user has access to)
- Appropriate rate limit/DoS protection
- TLS support
The Takeoff server handles all the technical operations to make sure that LLMs run as efficiently and ergonomically as possible. However, setting up a production system with Takeoff requires a few more steps.
In this article, we'll talk about how to use a reverse proxy to handle rate-limiting, TLS, and authorisation cleanly and efficiently.
What is a reverse proxy? And why do I need one?​
Let's start by defining some terms:
- A proxy server is a go‑between or intermediary server that forwards requests for content from multiple clients to different servers across the Internet.
- A reverse proxy server is a type of proxy server that typically sits behind the firewall in a private network and directs client requests to the appropriate backend server.
The reverse proxy provides an additional level of abstraction and control to ensure the smooth flow of network traffic between clients and servers. It can be used in myriad ways, we will walk through specifically how to implement authorisation/authentication, rate limiting and TLS support. It can also be used to load balance traffic across multiple servers, cache static content, and provide authentication but these are beyond the scope of this page.
Here are some useful links if you would like to setup other resources with a reverse proxy:
- Load Balancing: Caddy load balancing docs
- Authentication: oauth2-proxy
Tools such as caddy are useful to smoothly deploy reverse proxies which handle TLS naturally (traefik and nginx are also good alternatives). These also let you configure things like authorisation/authentication and rate limits before your requests hit the inference server.
Deploying Reverse Proxy with Caddy​
Now that we understand what a a reverse proxy is and why we might need one, we'll run through how to deploy one using Caddy. This section draws together information available in Caddy's official docs. Caddy provides a variety of methods to deploy a reverse proxy. The simplest of which is using the terminal (ensuring you have Caddy in your PATH
) to run the following command:
caddy reverse-proxy --from :2020 --to :3000
This command will create a reverse proxy that listens on port 2020 and forwards requests to port 3000. You can also use a Caddyfile
to configure your reverse proxy. Here is the equivalent Caddyfile
for the above command:
localhost:2020
reverse_proxy localhost:3000
Then you just need to run caddy run
in the same directory as your Caddyfile
and you're good to go.
HTTPS/TLS Termination​
HTTP is insecure. TLS stands for Transport Layer Security and is the standard for secure communication over the internet. It protects you from "man in the middle" attacks and ensures that your data is encrypted. TLS is essential for any web service deployed over the internet, so that you can safeguard user data, safely authenticate servers, maintain regulatory compliance, improve your search engine visibility and ensure that your user's can trust you.
For more information about how TLS works, see https://howhttps.works/why-do-we-need-https/.
If you own a domain you want your users to hit to access the Takeoff Inference Server, you can use Caddy to automatically provision a TLS certificate for your domain. Caddy will attempt to get a publicly-trusted certificate; make sure your DNS records point to your machine and that ports 80 and 443 are open to the public and directed toward Caddy. Here is an example Caddyfile
that listens on port 443 and forwards requests to port 3000:
example.com
reverse_proxy localhost:3000
If you don't specify a port, Caddy defaults to 443 for HTTPS. In that case you will also need permission to bind to low ports. A couple ways to do this on Linux:
- Run as root (
sudo -E
). sudo setcap cap_net_bind_service=+ep $(which caddy)
to give Caddy this specific capability.
Basic Authentication​
There are a number of authentication patterns that you might choose to setup. Starting with the simplest, you can use basic authentication to protect your Takeoff Inference Server. Here is an example Caddyfile
that listens on port 443 and forwards requests to port 3000 with basic authentication:
example.com {
basicauth {
user hashed_password
}
reverse_proxy localhost:3000
}
JWT Authorization​
Basic authentication is useful in development, but for production systems with many users, you'll need something more robust. One good method for managing user permissions is JSON Web Tokens (JWT). They're good because they're stateless, secure, and can be used to store user permissions. You can use them to provide Role-Based Access Control or even finer-grained permissions with Attribute-Based Access Control and is better than using basic authentication as you can use a separate, more secure service to authenticate your users. To read more about why basic authentication is generally advised against in production and which other authentication services are better this article will be of help. Delegating authentication away from your backend applications is super beneficial as they can work under the assumption that all users are valid and only need to make access decisions. For more information about JWTs, see jwt.io or auth0 docs.
Caddy has a number of plugins that can help you with this. One of which is highlighted on their official docs here. This external module caddy-jwt can be installed using the following command:
xcaddy --with github.com/ggicci/caddy-jwt
An example of a Caddyfile
that listens on port 443 and forwards requests to port 3000 with checking JWT authorisation:
{
order jwtauth before basicauth
}
example.com {
jwtauth {
sign_key TkZMNSowQmMjOVU2RUB0bm1DJkU3U1VONkd3SGZMbVk=
sign_alg HS256
jwk_url https://api.example.com/jwk/keys
from_query access_token token
from_header X-Api-Token
from_cookies user_session
issuer_whitelist https://api.example.com
audience_whitelist https://api.example.io https://learn.example.com
user_claims aud uid user_id username login
meta_claims "IsAdmin->is_admin" "settings.payout.paypal.enabled->is_paypal_enabled"
}
reverse_proxy localhost:3000
}
Rate Limiting​
Servers exposed to the internet without rate-limiting can leave you susceptible to denial of service (DoS) attacks. DoS attacks involve malicious users sending many requests to your web server in a short time in order to tie up resources and deny service to other users. Rate limiting sets a cap on the number of requests that can be processed in a given duration for a single user. Happily there is a rate limiting plugin for Caddy, you can find the official documentation here. This external module caddy-ext has the simplest interface. It can be install similarly as before with caddyx:
xcaddy build --with github.com/RussellLuo/caddy-ext/ratelimit
One such example of a Caddyfile
that listens on port 443 and forwards requests to port 3000 with rate limiting on requests to generate endpoint:
example.com {
rate_limit [<matcher>] <key> 1r/s
reverse_proxy localhost:3000
}
Where <matcher>
is a request matcher which specifies routes you want the limit to be applied to. <key>
is a unique identifier for the limit that can be one of caddy placeholders.
Putting It All Together With Docker​
We've looked at how to quickly deploy a Caddy reverse proxy and enable some useful components. Now let's look at how you'd bring this all together to be used in a Docker environment with the Takeoff Inference Server. Caddy provides an official docker image which helpfully has most of the dependencies we need pre-installed.
We are going to use Docker Compose to manage our two containers: one for the reverse proxy, one for Takeoff Inference Server. You could also imagine building this setup with other deployment tools (such as Kubernetes). Here is an example docker-compose.yml
file that deploys Caddy with a reverse proxy listening on localhost.
- docker-compose.yaml
services:
caddy:
build:
context: .
dockerfile: Dockerfile
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "443:443/udp"
- "2019:2019"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- ./site:/srv
takeoff:
image: tytn/takeoff-pro:0.12.0-gpu
container_name: takeoff
ports:
- "3000:3000"
- "3001:3001"
environment:
- RUST_LOG=INFO
volumes:
- $HOME/.takeoff_cache:/code/models
- ./takeoff-config.yaml:/code/config.yaml
deploy:
replicas: 1
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
- Caddyfile
Here we are setting a rate limit per http method to 2 requests per minute and using JWT authorisation to validate our users. The JWT authorisation is just the example from the caddy-jwt docs. You can use different identifiers for the rate limit and specify limits and different authorisation policies per routes. For this simple example we have just set a global rate limit and authorisation policy for all routes with the wildcard match *.
In production it is important to be more specific with your rate limits and authorisation policies. For instance, you might want to have a different rate limit for more sensitive endpoints that actually invoke your LLM.
localhost {
route * {
rate_limit {method} 2r/m
jwtauth {
sign_key TkZMNSowQmMjOVU2RUB0bm1DJkU3U1VONkd3SGZMbVk=
sign_alg HS256
jwk_url https://api.example.com/jwk/keys
from_query access_token token
from_header X-Api-Token
from_cookies user_session
issuer_whitelist https://api.example.com
audience_whitelist https://api.example.io https://learn.example.com
user_claims aud uid user_id username login
meta_claims "IsAdmin->is_admin" "settings.payout.paypal.enabled->is_paypal_enabled"
}
}
reverse_proxy takeoff:3000
}
We are routing our reverse proxy to takeoff:3000 instead of localhost:3000 like previous examples. This is because we are using docker-compose and Takeoff Inference Container will have it's own network namespace.
- Dockerfile
# Use the official caddy builder image to include xcaddy for plugins
FROM caddy:builder AS builder
# Download the extra plugins we need for rate limiting and jwt authorisation
RUN xcaddy build \
--with github.com/RussellLuo/caddy-ext/ratelimit \
--with github.com/ggicci/caddy-jwt
# Use the two stage build to keep the final image small as xcaddy isn't needed for run
FROM caddy:latest
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
- takeoff-config.yaml
takeoff:
server_config:
max_batch_size: 30
batch_duration_millis: 50
readers_config:
reader1:
model_name: TheBloke/Llama-2-7B-AWQ
device: cuda
max_batch_size: 5
max_sequence_length: 1024
consumer_group: primary
Running docker compose up --build
will start the Caddy and Takeoff Inference Server containers. The Caddy container will listen on localhost and forward requests to the Takeoff Inference Server container.
You can see the reverse proxy working now by sending the following cURL requests:
# Test JWT token that will pass our authorisation, decoded looks like:
# {
# "exp": 9955892670,
# "jti": "82294a63-9660-4c62-a8a8-5a6265efcd4e",
# "sub": "3406327963516932",
# "iss": "https://api.example.com",
# "aud": ["https://api.example.io"],
# "username": "ggicci"
# }
TEST_TOKEN=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjk5NTU4OTI2NzAsImp0aSI6IjgyMjk0YTYzLTk2NjAtNGM2Mi1hOGE4LTVhNjI2NWVmY2Q0ZSIsInN1YiI6IjM0MDYzMjc5NjM1MTY5MzIiLCJpc3MiOiJodHRwczovL2FwaS5leGFtcGxlLmNvbSIsImF1ZCI6WyJodHRwczovL2FwaS5leGFtcGxlLmlvIl0sInVzZXJuYW1lIjoiZ2dpY2NpIn0.O8kvRO9y6xQO3AymqdFE7DDqLRBQhkntf78O9kF71F8
# Will return 401 unauthorized as had no jwt token
curl -X POST -H "Content-Type: application/json" -d '{"text": "Hello, world!"}' http://localhost/generate
# Will return 200 response
curl -X POST -H"Authorization: Bearer ${TEST_TOKEN}" -H "Content-Type: application/json" -d '{"text": "Hello, world!"}' http://localhost/generate
# Will return 200 response
curl -X POST -H"Authorization: Bearer ${TEST_TOKEN}" -H "Content-Type: application/json" -d '{"text": "Hello, world!"}' http://localhost/generate
# Will return 429 too many requests as you have exceeded the rate limit of 2 requests per minute we set in the Caddyfile
curl -X POST -H"Authorization: Bearer ${TEST_TOKEN}" -H "Content-Type: application/json" -d '{"text": "Hello, world!"}' http://localhost/generate
# Will return 200 response as is a different https method, as specified in the Caddyfile the rate limit is applied per method
curl -X GET -H"Authorization: Bearer ${TEST_TOKEN}" http://localhost/status
So now you should be able to deploy your own reverse proxy behind your production web servers. You can take this basic example and tune it to your particular needs, if you have more questions please contact your account manager or reach out to us at hello@titanml.co.