Blog | TitanML docs

Building with TitanML: Leveraging LLMs for Creative Applications

August 16, 2023 · 12 min read

Machine Learning Engineer, TitanML

As the popularity and accessibility of Language Learning Models (LLMs) continue to grow among the general public, creative individuals are leveraging them to develop an extensive array of inventive applications.

In this post, we demonstrate how you can use the TitanML Platform to fine-tune your own LLMs for one such creative use-case: creating an app to detect the critical aspects of feedback and transform them into constructive, actionable and encouraging feedback.

This simple app would help people become more aware of how their words might affect others and give them examples on how to turn their feedback into constructive ones that would support healthy interpersonal relationships.

The Problem

Critical vs. Constructive Feedback

In order to give truly constructive feedback, it helps to understand the difference between constructive and critical feedback. Understanding this difference is crucial in turning feedback into a tool for learning and growth.

Critical feedback focuses on pointing out the problems without offering solutions. They can often be harsh, overly negative and directed at the qualities of a person rather than their work. They are often vague and can contain sweeping generalisations, as well as exaggerations, personal attacks and accusations. In addition, such feedback is usually emotionally charged and can often be demotivating and discouraging for the individual on the receiving end.

In contrast, constructive feedback is usually positive and uplifting with specific and actionable suggestions. The main aim of constructive feedback is to help the individual improve by objectively pointing out the strengths and weaknesses of the work done by the individual.

Preparation

Here we are going to walk through the steps to fine-tuning your own LLMs, which you will be able to use in your application.

Design the Model Specifications
Generate datasets with OpenAI’s API
Fine-tune models with TitanML
Deployment with Titan Takeoff Server/Triton Server

Designing Model specifications

The application we want to build would have two simple features:

Identify instances of critical/constructive feedback and an explain why it is critical/constructive.
Improve the given feedback by providing a constructive version of the feedback.

While both features could be implemented easily with an overarching LLM such as GPT-3 with different prompts, it may be far more efficient to use the two smaller models that are geared towards certain tasks.

For the first feature, we can use a sequence classification model that can produce a class label such as “critical” or “constructive” for each sentence. We could even go further to split the labels into more finer categories that would explain why the text is critical or constructive, such as the following:

Positive Comment (constructive)
Helpful Suggestion (constructive)
Balanced Criticism (constructive)
Vague Criticism (critical)
Harsh Criticism (critical)
Sarcastic Comment (critical)
Blameful Accusation (critical)
Personal Attack (critical)
Threat (critical)

This is not meant to be an exhaustive list. There may be other categories of critical or constructive comments, or sentences that may fall into more than one category, but these categories would be more than helpful in identifying why a part of the feedback would be critical or constructive.

For the second feature, we can use an encoder-decoder model that would “translate” critical feedback into constructive feedback. One good model to use would be the T5 (Text to Text Transfer Transformer) model. The model would take in a string of text and produce another string of text.

Now that we’ve outlined the basic model to use, we can move on to finding datasets to fine-tune our models.

Generating datasets with OpenAI

As there are no readily available labelled datasets relevant to our tasks, we will have to generate them with OpenAI’s API. This is a technique we have seen previously.

In order to train the two separate models, we have to generate one dataset for each model.

Dataset 1 (Classification dataset)

The first dataset will be used to train the sequence classification model, which will take in a piece of text and return its corresponding label. Thus, the dataset will require two columns: sentence and label. Here we are going to use OpenAI’s text-davinci-003 model as they are better at understanding more complex instructions and producing standardised outputs that will be easier to parse.

An example of the prompt we used to generate the dataset is as follows:

Details

Our prompt

You are an expert in providing constructive feedback and are conducting a workshop to teach people how to transform instances of negative feedback into constructive feedback. Critical feedback is usually vague, accusatory and often focuses on the negative qualities of a person without containing much details. Constructive feedback is uplifting, given with a compassionate and helpful attitude, and usually contains clear and actionable suggestions for improvement. Can you generate 10 examples of critical feedback that contains harsh criticism (this can be replaced with labels from other categories) ?

This should be the format of the json:

[
  "Your work lacks the quality to meet the requirements.",
  "You seem clueless when it comes to executing this task."
]

Dataset 2 (Translation Dataset)

The second dataset will be used to train the T5 model, which will take in a text containing a critical feedback and return the constructive version of the same feedback.

While generating the datasets for the first time, we discovered that most of the critical feedback we generated pertained to presentations (e.g. your presentation lacked a sense of structure, your presentation was boring etc.). This was probably because the example feedback we gave to the prompt was about presentations. This instance of oversampling can lead to poor performance of the model with real-world data. In order to address this, we specifically requested these examples of feedback to be from a different workplace context.

We used the following improved prompt:

Details

Our improved prompt

You are an expert in providing constructive feedback and are conducting a workshop to teach people how to transform instances of negative feedback into constructive feedback. Negative feedback is usually vague, accusatory and often focuses on the negative qualities of a person without containing much details. Constructive feedback is uplifting, given with a compassionate and helpful attitude, and usually contains clear and actionable suggestions for improvement.

Here is an example:

Negative Feedback: “Why was your presentation so confusing? You know that not everyone thinks like you.”

Constructive Feedback: “I think your presentation was ambitious in terms of coverage but could have been structured better to help audience to follow your presentation better. Would you be able to restructure your presentation the next time?”

Can you generate 5 pairs of negative feedback and the constructive version of each feedback in a different workplace context and put it in json format?

This should be the format of the json:

[
  {
    "Context": "You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report.",
    "Negative Feedback": "You are spending too much time on meaningless tasks.",
    "Constructive Feedback": "I think you are doing a great job formatting the report and designing the charts, however, it would be great if you could first focus on getting the research to a good standard first."
  },
  {
    "Context": "You are the portfolio manager of a hedge fund and you are giving feedback to an analyst on their stock pitch.",
    "Negative Feedback": "Why didn't you include the fundamentals of the company in your report?",
    "Constructive Feedback": "I liked how concise your report was in summarizing the main points, but the clients might demand a bit more research on the fundamentals of each stock. Could you include more information in your next version?"
  }
]

Dataset Preparation

We used JSON as a default output format for OpenAI as it is relatively standardised and easy to parse. We then converted the data and compiled them into a single csv file. Afterwards, we shuffled the rows of the dataset to ensure the labels are distributed evenly throughout, before splitting the dataset into two files, train.csv and validation.

For the Translation dataset, we combined the context column and negative feedback column into a single column as an instruction.

An Example

{
  "Context": "You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report",
  "Negative Feedback": "You are spending too much time on meaningless tasks"
}

becomes:

"Context: You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report. Make the following feedback constructive: You are spending too much time on meaningless tasks."

Fine-tuning models with TitanML

Now that we have generated our datasets with OpenAI, we can use TitanML to fine-tune models for our specific tasks.

First, we uploaded the classification dataset to TitanHub with the Iris CLI:

iris upload <dataset_directory_path> feedback_classification_dataset

Next, we used the command generator feature to dispatch a new job/experiment. For this fine-tuning experiment, we are using the google/electra-base-discriminator model from HuggingFace. We can also select the dataset we previously uploaded with the dropdown.

As this is a sequence classification model, we had to fill in the number of labels and text field. We also have the option to provide configurations for hyperparameter tuning.

We can then run the Iris command from the terminal to dispatch a fine-tuning experiment to our cluster. Alternatively, you may also try out the one-click dispatch feature that’s available if your models and datasets are already on HuggingFace or uploaded to Titan Hub.

After running four different fine-tuning experiments, with a different number of epochs (1, 2, 3 and 4), we found that the model trained for 3 epochs give the optimal results (highest accuracy, lowest loss). We will use this model for our application to classify feedback.

For the T5 Model, the process is quite similar: we upload the dataset, then use the command generator to generate an Iris command to dispatch our experiment. However, in terms of evaluating the performance of models, it may be better to use the inference API by evaluating their outputs. This time, instead of focusing on hyperparameter tuning, we can try fine-tuning different variants of Google’s Flan T5 Model (Flan-T5-Small, Flan-T5-Large, Flan-T5-XL) to see which model gives us the best performance/size trade-off.

For generative models, we can use the Titan Inference API to test each models by giving them an input and judging them by the quality of their outputs.

We gave the models the following inputs to be made constructive:

Inputs

Context: You are a manager giving feedback to your subordinate, who has been underperforming severely over the past few months.

Critical text: Your performance over the past few months has been absolutely disappointing. It doesn’t seem that you’ve put in any effort in improving your performance at all. I’m afraid that we will have to evaluate your position in this company if this continues.

These are the outputs of the fine-tuned models:

Outputs

Flan-T5-small: I would suggest that your subordinate acted independently as he/she has been underperforming in the past. Please let me know if this is remedied.

Flan-T5-large: I hope this criticism shows that you’ve put your efforts in. Keep up the good job of managing your performance.

Flan-T5-XL: I think that you have tried your best to improve your performance, however, it would be beneficial for you to focus on your own strengths and learn from your mistakes to help improve your performance. Would you be able to set some time aside to focus on your own strengths and improve your performance?

From these examples, we see that the Flan-T5-small and Flan-T5-large models seem confused about the task, responding to the feedback rather than transforming it into a constructive version. Only the fine-tuned Flan-T5-XL model was able to produce coherent responses and thus this is what we will use for our application.

Deploying the models

In order to inference these models on-demand, we would need to deploy them. These are the two main ways to do them within the Titan ecosystem.

Titan Takeoff Server

For our T5-Model, one of the most convenient options is to use the Titan Takeoff Server, accessible through our Iris CLI. You can find detailed instructions on how to do this here. Apart from a built-in CLI chatbot, you will also be able to spin up a server and make API requests to it, receiving either complete responses or streaming token by token.

The finished product

After deploying our models, we can then use the associated endpoints to inference our models. We have built a simple frontend where users can input their feedback and provide the context of their feedback. Upon submission, each sentence would be analysed for its constructiveness and the user will be given a chance to improve the feedback with the click of a button. With a few simple steps, users will be able to get a simple analysis of their feedback and an improved version of their feedback.

Conclusion

With the TitanML Platform, you can fine-tune LLMs easily for almost every use-case, and with the Titan Takeoff Inference Server, you can deploy them in production with ease. To start applying cutting edge ML performance and latency optimisations to your own projects and models, checkout the TitanML platform! If you have any questions, comments, or feedback on the TitanML platform, please reach out to us on our discord server. For help with LLM deployment in general, or to signup for the pro version of the Titan Takeoff Inference Server, with features like automatic batching, multi-gpu inference, monitoring, authorization, and more, please reach out at hello@titanml.co.

LLMs in Production: Deploying the TitanML Takeoff Server on Kubernetes

August 16, 2023 · 12 min read

Fergus Finn

CTO, TitanML

Large Language Models (LLMs) are a transformative new technology that have great potential to transform the way that we build software. They generate text, answer questions, and write code. However, deploying these models remains challenging due to their size and the substantial compute resources they require. This post is focused on using two infrastructure tools, Docker and Kubernetes, to deploy Titan Takeoff, a docker image that bundles optimization and serving technology specifically designed for LLMs. We're following on from our primer where we give an introduction to Docker and Kubernetes, and explain how they can be used to deploy machine learning models.

LLMs in Production: Docker and Kubernetes for Machine Learning

August 15, 2023 · 4 min read

Fergus Finn

CTO, TitanML

In the present age, large language models are shaping our world in ways we never anticipated. They generate text, answer questions, and are even writing code. The power they possess to revolutionize the way we live our lives is profound. However, deploying these behemoths is a challenge. They're big, they demand significant compute resources to function, and the field of MLOps, which focuses on applying DevOps practices to machine learning workflows, is complex and still being explored.

In this blog post, we're going to introduce a crucial building block of modern MLOps - the container - and dive into a popular container orchestrator called Kubernetes. Let's start our journey into this exciting world.

Inference Optimization: Why GPUs for machine learning?

August 2, 2023 · 6 min read

Fergus Finn

CTO, TitanML

NVIDIA's stock price recently hit record levels¹, on an earnings report that showed their data center sales had gone through the roof. Those datacenter units were sold to companies trying to produce AI enabled applications. But why has AI led to this rush to buy GPUs? Why Graphics Processing Units? The answer lies in their potential for parallelising machine learning workloads by dividing them up and allowing multiple operations to be undertaken simultaneously.

Large Language Models

We can answer this question by looking to language models, which can be thought of as a sophisticated tool designed to work with text. To illustrate, consider the autoregressive language models, whose primary task is to read a piece of text and predict the most fitting continuation.

Click the button below to see an example of an autoregressive language model in action².

example

The quick brown fox

In order to achieve this, the language model will need to convert the input text into a list of numbers called a vector, that stores a information about a word. This process is called tokenization and is important as computers don't understand language the way humans do and can't intuitively know the meaning or sentiment of a word.

However, they are excellent at handling numbers. So, by representing words as vectors that encapsulate textual information such as semantic meaning, similarities with other words, contextual information, grammatical properties, we can feed this information into machine learning models that can then process, analyse, or even generate language.

What is Matrix Multiplication?

At the heart of machine learning lies an operation called matrix multiplication, which underpin many of the key operations used in machine learning. Matrix multiplication is the process of taking a grid of numbers called a matrix and using it to transform the input text vector from one vector to another.

This transformation turns one representation of our input text into another, rotating and skewing it in space until it looks completely different. By transforming the input text in this way (interspersed with simple nonlinear transformations), we can capture the process of generating new text from old, by viewing it as a complicated transformation in a high-dimensional space.

When it comes to the forward operation of a machine learning model, the most resource-intensive step is computing the results of matrix multiplications[2]. This is where the role of GPUs becomes pivotal. Now, it's important to understand that matrix multiplications have a unique characteristic: they're inherently parallelisable.

In the example above, clicking the "Next Step" button only calculates a single element of the output vector. Yet, each single calculation isn't dependent on the other. This means, if we have N computing units available, we could potentially compute N elements simultaneously, leading to a significant boost in the model's operational speed.

Here's where the difference between CPUs and GPUs becomes evident. CPUs are primarily designed to execute a limited set of operations at lightning speed, making them unsuitable for such parallel tasks. GPUs, however, are specifically engineered for these extensive parallel workloads, making them indispensable in the realm of machine learning. Thus, the solution to the NVIDIA mystery.

GPU Types - Which ones to get?

Why choose NVIDIA when there are numerous GPU providers out there? The consistent preference for NVIDIA in the machine learning arena can be attributed to its software. NVIDIA's CUDA software stack stands out as the most mature and widely-adopted platform. Notably, it seamlessly integrates with modern deep learning libraries like PyTorch, JAX, and Tensorflow. Programming with CUDA is straightforward, and the powerful abstraction layers built atop it make the process even more efficient.

NVIDIA manufactures two distinct types of GPUs: those designed for consumers and those tailored for data centers. The most recent and advanced consumer GPU series for deep learning is the RTX 40xx. On the other hand, NVIDIA's datacenter GPUs, which are available through cloud providers, represent a pricier yet significantly more potent option.

The A100, for exmple, is a previous generation datacenter GPU that was foundational in the training and inference of Large Language Models. The latest generation, the H100, is even more powerful. If you are looking for a comprehensive analysis on which consumer GPU to invest in for machine learning development, you can read more about it here.

Why isn't my model running on my GPU?

The most common and most dreaded experience people have when working with deep learning on GPUs is the Out Of Memory (OOM) error. This occurs when the model that you're trying to work with is too large for the memory on your GPU.

So what are your options when you get an OOM error? To most people, the most straightforward option is to procure a better GPU or rent one from a cloud provider, but this is often costly and unneccessary. The more sustainable alternative is to optimise your model.

This refers to the process of making your model smaller, faster, and more efficient. There are many different inference optimisation techniques that we use to bring you the best performance on our Titan Takeoff Server. As this is a huge topic, and we'll be writing more about it in the future, so do stay tuned!

Conclusions

In this post, we've seen how GPUs are the best option for machine learning workloads. We've talked about what GPUs are available, and how to choose between them. Finally, we've talked about the importance of inference optimization, to make sure that your model is running as efficiently as possible.

https://edition.cnn.com/2023/08/23/tech/nvidia-earnings-ai-chips/index.html ↩
Simplified: in practise, the generated fragments don't correspond to words, but instead text fragments, called tokens: this process is called tokenization. For an example of how words are broken down, see openAI's tokenization demo. ↩

Building with TitanML: Summarise Arxiv Papers Like a Pro

August 2, 2023 · 8 min read

Hamish Hall

Machine Learning Engineer, TitanML

The pace of ML research is accelerating, and the amount of information available is growing exponentially. It's becoming increasingly difficult to keep up with the latest developments in your field, let alone the wider world of research. The TitanML platform incorporates the techniques from this fast-moving field to make it easy, fast, and efficient to build NLP applications. To help us keep up with the firehose of information, we can use NLP to summarise and answer questions about papers.

hello — Interact with Arxiv papers: see the demo at http://54.167.108.88:8501/

LLMs in Production: Deploying the TitanML Takeoff server on Google Cloud Run

August 2, 2023 · 10 min read

Fergus Finn

CTO, TitanML

Hi! In the last post, we deployed a simple LLM endpoint using the Titan Takeoff server using an AWS EC2 instance. We compared performance between a GPU enabled instance and a CPU only instance, and between the Takeoff server, and raw huggingface/pytorch. In this post, we'll look at another cloud provider, and try out their tooling for deploying LLM endpoints. We'll use the same Takeoff server, but this time we'll deploy it using Google Cloud, specifically, their Google Cloud Run service . On the way, we'll discuss a little bit about serverless, and how the Takeoff server means that we can use serverless tools to deploy LLM endpoints.

LLMs in Production: Deploying the TitanML Takeoff server on AWS EC2

August 1, 2023 · 15 min read

Fergus Finn

CTO, TitanML

Getting large language models into production quality deployments is a complicated and difficult process. At TitanML, our goal is to make this process faster, easier, and cheaper. Let's go step-by-step through the deployment process of a large language model with AWS using the AWS CLI.

The Problem​

Critical vs. Constructive Feedback​

Preparation​

Designing Model specifications​

Generating datasets with OpenAI​

Dataset Preparation​

Fine-tuning models with TitanML​

Deploying the models​

Titan Takeoff Server​

The finished product​

Conclusion​

Large Language Models​

What is Matrix Multiplication?​

GPU Types - Which ones to get?​

Why isn't my model running on my GPU?​

Conclusions​

Footnotes​

The Problem

Critical vs. Constructive Feedback

Preparation

Designing Model specifications

Generating datasets with OpenAI

Dataset Preparation

Fine-tuning models with TitanML

Deploying the models

Titan Takeoff Server

The finished product

Conclusion

Large Language Models

What is Matrix Multiplication?

GPU Types - Which ones to get?

Why isn't my model running on my GPU?

Conclusions

Footnotes