Skip to main content

4 posts tagged with "train"

View All Tags

Building Applications with TitanML: What can I build with Takeoff Server?

· 7 min read
Blake Ho
Machine Learning Engineer, TitanML

Breaking down barriers to AI adoption​

The release of various open-source Large Language Models (LLMs) this year has democratised the access to AI and its associated technologies. Colossal models like the Llama-2 70B or even the Falcon 180B represent incredible opportunities for those who can harness their power.

While anyone can certainly download a copy of these models, numerous AI enthusiasts face many barriers tapping into the power of these powerful models. These barriers may seem daunting; not only would inferencing these models require huge amounts of compute power, deploying these models is also a complicated affair.

This is why we've built the Titan Takeoff Server: to break down these barriers to AI adoption and allow everyone to deploy and tap into the power of these LLMs easily, so they can focus on building the AI-powered apps they care about.

What exactly is Titan Takeoff Server?​

In short, Titan Takeoff Server is a package that allows you to deploy and inference an LLMs easily and efficiently.

Simplified Deployment​

Titan Takeoff Server takes care of the difficulties of deploying and serving large language models, so you don't have to spend endless hours worrying about setting the right configurations and compatibility with your deployment environment.  With a few simple commands, you'll be able to deploy your LLMs to anywhere you want, be it on your local machine or on the cloud. Check out our guides showing you how to deploy them on AWS, Google Cloud and Kubernetes.

Control over Data and Models​

In an era where data privacy and proprietary models are paramount, Titan Takeoff Server stands out, allowing you to retain full ownership and control over your data, ensuring that sensitive information remains on-premises and is never exposed to third-party vulnerabilities.

Inference Optimisation​

Inferencing LLMs can be very compute intensive, so we've developed the Titan Takeoff Server to be memory efficient, using state of the art quantisation techniques to compress your LLMs. This also means that you will be able to support much larger models without upgrading your existing hardware.

How can I build apps with Takeoff?​

Starting the Titan Takeoff Server​

To get set up with the Titan Takeoff Server, there are only two commands that you need to run. The first command installs the Iris CLI, which interfaces with the Titan Takeoff Server.

pip install titan-iris

The second and final command is the takeoff command, which optimises and loads your model before starting up a server. Note that you'll have to specify the model name as well and will be given a choice of a few optional parameters.

iris takeoff
--model <model_name> # Specify model name
--device cuda # Select CPU or GPU(cuda)
--port 8000 # Specify port number for server
--token <token> # Needed for Llama-2 models

With the Titan Takeoff Server running, our model is ready to be inferenced on-demand, so it's time to start building apps.  There are two main ways your app can interface with your model: through Titan Takeoff Server's own inference endpoints or the integration with LangChain.

Inference Endpoints​

Titan Takeoff Server exposes two main inference endpoints: generate and generate_stream. If you want your response to be streamed back gradually, you should use the generate_stream endpoint, otherwise your response will only be returned as a whole chunk when it is ready. You can also specify your desired generation parameters, such as temperature, maximum token lengths etc.

LangChain Integration​

Titan Takeoff Server also has an integration with LangChain, allowing you to access your model through LangChain's interface. This makes it easy to access a wealth of different tools and other integrations that may be needed for downstream processing. Click here to view our docs relating to the LangChain Integration. What kind of apps can you build with Titan Takeoff Server? During our dogfooding exercise, the TitanML team built several apps that showcased the breadth of what you can build with the Titan Takeoff Server:

  1. Chatbot with summarization capabilities that can allow you to ask questions about an Arxiv academic paper
  2. Writing tool to detect critical feedback and turn them into constructive feedback
  3. App to generate Knowledge Graphs from news articles

The possibilities are endless with what you can create with LLMs. And if you're still struggling for ideas, here are some examples to stoke your imagination:

Conversational AI Chatbots​

To power a chatbot with Titan Takeoff Server, begin by deploying a conversational model, possibly a variant of GPT or Falcon optimised for dialogue. Titan simplifies this deployment process by allowing you to load and serve the model locally. Once set up, you can integrate this server with your chat application's backend, ensuring efficient handling of user requests.

By coupling real-time processing capabilities of Titan with a user-friendly UI, you'll have a chatbot that can address user queries, engage in meaningful interactions, and provide context-aware solutions, all powered locally without the need for external APIs.

Content Creation & Enhancement​

Content creators often struggle with writer's block or need assistance in refining their drafts. Using Titan Takeoff Server, you can deploy a language model tailored for content generation or enhancement. Integrate this with platforms like CMS or blogging tools, where users can input topic prompts or existing drafts. The Titan Takeoff server can suggest content drafts, refine sentences, or even generate catchy headlines in real-time. By doing this, you offer a dynamic writing assistant that not only aids in creating content but also ensures it's engaging and well-structured, all while ensuring data remains local and private.

Educational Tutor Apps​

Modern learning experiences can be augmented with AI-powered tutors. Using Titan Takeoff Server, deploy a model trained for educational explanations. You can develop an interactive platform where students can input their questions or topics of confusion. Their queries can be sent to the Titan Takeoff Server, which then consults an educational model to produce coherent, easy-to-understand explanations. Such an app can be a boon for learners, providing them instant access to clarifications, supplementary content, and personalized learning resources, all while ensuring the data remains on-premises, preserving student privacy.

Bonus: Retrieval Augemented Generation with Vector Databases​

If you have deployed an extremely large model unsuitable for fine-tuning or constantly require up to date information, you can consider implementing Retrieval Augmented Generation (RAG). RAG is a technique that combines the strengths of large pre-trained models with external knowledge databases. Instead of solely relying on the model's internal knowledge, which might be outdated or limited, RAG queries an external database in real-time to fetch relevant information or context before generating a response.

To enhance the accuracy of your results, as well as the speed of retrieval, you can even consider using a vector databases such as Weaviate or Pinecone. Vector databases enable rapid, real-time semantic searches, allowing systems to retrieve information based on conceptual similarity rather than just exact matches. This ensures faster, more contextually relevant results, bridging the gap between raw data and genuine understanding.

This approach can be particularly useful for chatbots in dynamic sectors where current data is paramount, such as finance, news, or technology trends. With Titan Takeoff Server's optimized inference capabilities, incorporating RAG can lead to more informed, up-to-date, and contextually aware responses, elevating the overall user experience of your conversational AI application.

Conclusion​

In all of these applications, Titan Takeoff Server acts as the local powerhouse, offering real-time, efficient, and secure model inferencing, which can produce transformative solutions, when combined with tailored models and thoughtful user experience design. We can't wait to see what you choose to build!

Building with TitanML: Leveraging LLMs for Creative Applications

· 12 min read
Blake Ho
Machine Learning Engineer, TitanML

As the popularity and accessibility of Language Learning Models (LLMs) continue to grow among the general public, creative individuals are leveraging them to develop an extensive array of inventive applications.

In this post, we demonstrate how you can use the TitanML Platform to fine-tune your own LLMs for one such creative use-case: creating an app to detect the critical aspects of feedback and transform them into constructive, actionable and encouraging feedback.

This simple app would help people become more aware of how their words might affect others and give them examples on how to turn their feedback into constructive ones that would support healthy interpersonal relationships.

The Problem​

Critical vs. Constructive Feedback​

In order to give truly constructive feedback, it helps to understand the difference between constructive and critical feedback. Understanding this difference is crucial in turning feedback into a tool for learning and growth.

Critical feedback focuses on pointing out the problems without offering solutions. They can often be harsh, overly negative and directed at the qualities of a person rather than their work. They are often vague and can contain sweeping generalisations, as well as exaggerations, personal attacks and accusations. In addition, such feedback is usually emotionally charged and can often be demotivating and discouraging for the individual on the receiving end.

In contrast, constructive feedback is usually positive and uplifting with specific and actionable suggestions. The main aim of constructive feedback is to help the individual improve by objectively pointing out the strengths and weaknesses of the work done by the individual.

Preparation​

Here we are going to walk through the steps to fine-tuning your own LLMs, which you will be able to use in your application.

  1. Design the Model Specifications
  2. Generate datasets with OpenAI’s API
  3. Fine-tune models with TitanML
  4. Deployment with Titan Takeoff Server/Triton Server

Designing Model specifications​

The application we want to build would have two simple features:

  1. Identify instances of critical/constructive feedback and an explain why it is critical/constructive.
  2. Improve the given feedback by providing a constructive version of the feedback.

While both features could be implemented easily with an overarching LLM such as GPT-3 with different prompts, it may be far more efficient to use the two smaller models that are geared towards certain tasks.

For the first feature, we can use a sequence classification model that can produce a class label such as “critical” or “constructive” for each sentence. We could even go further to split the labels into more finer categories that would explain why the text is critical or constructive, such as the following:

  1. Positive Comment (constructive)
  2. Helpful Suggestion (constructive)
  3. Balanced Criticism (constructive)
  4. Vague Criticism (critical)
  5. Harsh Criticism (critical)
  6. Sarcastic Comment (critical)
  7. Blameful Accusation (critical)
  8. Personal Attack (critical)
  9. Threat (critical)

This is not meant to be an exhaustive list. There may be other categories of critical or constructive comments, or sentences that may fall into more than one category, but these categories would be more than helpful in identifying why a part of the feedback would be critical or constructive.

For the second feature, we can use an encoder-decoder model that would “translate” critical feedback into constructive feedback. One good model to use would be the T5 (Text to Text Transfer Transformer) model. The model would take in a string of text and produce another string of text.

Now that we’ve outlined the basic model to use, we can move on to finding datasets to fine-tune our models.

Generating datasets with OpenAI​

As there are no readily available labelled datasets relevant to our tasks, we will have to generate them with OpenAI’s API. This is a technique we have seen previously.

In order to train the two separate models, we have to generate one dataset for each model.

Dataset 1 (Classification dataset)

The first dataset will be used to train the sequence classification model, which will take in a piece of text and return its corresponding label. Thus, the dataset will require two columns: sentence and label. Here we are going to use OpenAI’s text-davinci-003 model as they are better at understanding more complex instructions and producing standardised outputs that will be easier to parse.

An example of the prompt we used to generate the dataset is as follows:

Details

Our prompt You are an expert in providing constructive feedback and are conducting a workshop to teach people how to transform instances of negative feedback into constructive feedback. Critical feedback is usually vague, accusatory and often focuses on the negative qualities of a person without containing much details. Constructive feedback is uplifting, given with a compassionate and helpful attitude, and usually contains clear and actionable suggestions for improvement. Can you generate 10 examples of critical feedback that contains harsh criticism (this can be replaced with labels from other categories) ?

This should be the format of the json:

[
"Your work lacks the quality to meet the requirements.",
"You seem clueless when it comes to executing this task."
]

Dataset 2 (Translation Dataset)

The second dataset will be used to train the T5 model, which will take in a text containing a critical feedback and return the constructive version of the same feedback.

While generating the datasets for the first time, we discovered that most of the critical feedback we generated pertained to presentations (e.g. your presentation lacked a sense of structure, your presentation was boring etc.). This was probably because the example feedback we gave to the prompt was about presentations. This instance of oversampling can lead to poor performance of the model with real-world data. In order to address this, we specifically requested these examples of feedback to be from a different workplace context.

We used the following improved prompt:

Details

Our improved prompt You are an expert in providing constructive feedback and are conducting a workshop to teach people how to transform instances of negative feedback into constructive feedback. Negative feedback is usually vague, accusatory and often focuses on the negative qualities of a person without containing much details. Constructive feedback is uplifting, given with a compassionate and helpful attitude, and usually contains clear and actionable suggestions for improvement.

Here is an example:

Negative Feedback: “Why was your presentation so confusing? You know that not everyone thinks like you.”

Constructive Feedback: “I think your presentation was ambitious in terms of coverage but could have been structured better to help audience to follow your presentation better. Would you be able to restructure your presentation the next time?”

Can you generate 5 pairs of negative feedback and the constructive version of each feedback in a different workplace context and put it in json format?

This should be the format of the json:

[
{
"Context": "You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report.",
"Negative Feedback": "You are spending too much time on meaningless tasks.",
"Constructive Feedback": "I think you are doing a great job formatting the report and designing the charts, however, it would be great if you could first focus on getting the research to a good standard first."
},
{
"Context": "You are the portfolio manager of a hedge fund and you are giving feedback to an analyst on their stock pitch.",
"Negative Feedback": "Why didn't you include the fundamentals of the company in your report?",
"Constructive Feedback": "I liked how concise your report was in summarizing the main points, but the clients might demand a bit more research on the fundamentals of each stock. Could you include more information in your next version?"
}
]

Dataset Preparation​

We used JSON as a default output format for OpenAI as it is relatively standardised and easy to parse. We then converted the data and compiled them into a single csv file. Afterwards, we shuffled the rows of the dataset to ensure the labels are distributed evenly throughout, before splitting the dataset into two files, train.csv and validation.

For the Translation dataset, we combined the context column and negative feedback column into a single column as an instruction.

An Example
{
"Context": "You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report",
"Negative Feedback": "You are spending too much time on meaningless tasks"
}

becomes:

"Context: You are a manager at a consulting firm and you are giving feedback to a junior consultant on their report. Make the following feedback constructive: You are spending too much time on meaningless tasks."

Fine-tuning models with TitanML​

Now that we have generated our datasets with OpenAI, we can use TitanML to fine-tune models for our specific tasks.

First, we uploaded the classification dataset to TitanHub with the Iris CLI:

iris upload <dataset_directory_path> feedback_classification_dataset

Next, we used the command generator feature to dispatch a new job/experiment. For this fine-tuning experiment, we are using the google/electra-base-discriminator model from HuggingFace. We can also select the dataset we previously uploaded with the dropdown.

As this is a sequence classification model, we had to fill in the number of labels and text field. We also have the option to provide configurations for hyperparameter tuning.

command builder
The TitanML Train command builder

We can then run the Iris command from the terminal to dispatch a fine-tuning experiment to our cluster. Alternatively, you may also try out the one-click dispatch feature that’s available if your models and datasets are already on HuggingFace or uploaded to Titan Hub.

After running four different fine-tuning experiments, with a different number of epochs (1, 2, 3 and 4), we found that the model trained for 3 epochs give the optimal results (highest accuracy, lowest loss). We will use this model for our application to classify feedback.

experiments
The results of the fine-tuning experiments

For the T5 Model, the process is quite similar: we upload the dataset, then use the command generator to generate an Iris command to dispatch our experiment. However, in terms of evaluating the performance of models, it may be better to use the inference API by evaluating their outputs. This time, instead of focusing on hyperparameter tuning, we can try fine-tuning different variants of Google’s Flan T5 Model (Flan-T5-Small, Flan-T5-Large, Flan-T5-XL) to see which model gives us the best performance/size trade-off.

For generative models, we can use the Titan Inference API to test each models by giving them an input and judging them by the quality of their outputs.

We gave the models the following inputs to be made constructive:

Inputs

Context: You are a manager giving feedback to your subordinate, who has been underperforming severely over the past few months.

Critical text: Your performance over the past few months has been absolutely disappointing. It doesn’t seem that you’ve put in any effort in improving your performance at all. I’m afraid that we will have to evaluate your position in this company if this continues.

These are the outputs of the fine-tuned models:

Outputs

Flan-T5-small: I would suggest that your subordinate acted independently as he/she has been underperforming in the past. Please let me know if this is remedied.

Flan-T5-large: I hope this criticism shows that you’ve put your efforts in. Keep up the good job of managing your performance.

Flan-T5-XL: I think that you have tried your best to improve your performance, however, it would be beneficial for you to focus on your own strengths and learn from your mistakes to help improve your performance. Would you be able to set some time aside to focus on your own strengths and improve your performance?

From these examples, we see that the Flan-T5-small and Flan-T5-large models seem confused about the task, responding to the feedback rather than transforming it into a constructive version. Only the fine-tuned Flan-T5-XL model was able to produce coherent responses and thus this is what we will use for our application.

Deploying the models​

In order to inference these models on-demand, we would need to deploy them. These are the two main ways to do them within the Titan ecosystem.

Titan Takeoff Server​

For our T5-Model, one of the most convenient options is to use the Titan Takeoff Server, accessible through our Iris CLI. You can find detailed instructions on how to do this here. Apart from a built-in CLI chatbot, you will also be able to spin up a server and make API requests to it, receiving either complete responses or streaming token by token.

The finished product​

After deploying our models, we can then use the associated endpoints to inference our models. We have built a simple frontend where users can input their feedback and provide the context of their feedback. Upon submission, each sentence would be analysed for its constructiveness and the user will be given a chance to improve the feedback with the click of a button. With a few simple steps, users will be able to get a simple analysis of their feedback and an improved version of their feedback.

frontend
The frontend of our application

Conclusion​

With the TitanML Platform, you can fine-tune LLMs easily for almost every use-case, and with the Titan Takeoff Inference Server, you can deploy them in production with ease. To start applying cutting edge ML performance and latency optimisations to your own projects and models, checkout the TitanML platform! If you have any questions, comments, or feedback on the TitanML platform, please reach out to us on our discord server. For help with LLM deployment in general, or to signup for the pro version of the Titan Takeoff Inference Server, with features like automatic batching, multi-gpu inference, monitoring, authorization, and more, please reach out at hello@titanml.co.

Inference Optimization: Why GPUs for machine learning?

· 6 min read
Fergus Finn
CTO, TitanML

NVIDIA's stock price recently hit record levels1, on an earnings report that showed their data center sales had gone through the roof. Those datacenter units were sold to companies trying to produce AI enabled applications. But why has AI led to this rush to buy GPUs? Why Graphics Processing Units? The answer lies in their potential for parallelising machine learning workloads by dividing them up and allowing multiple operations to be undertaken simultaneously.

Large Language Models​

We can answer this question by looking to language models, which can be thought of as a sophisticated tool designed to work with text. To illustrate, consider the autoregressive language models, whose primary task is to read a piece of text and predict the most fitting continuation.

Click the button below to see an example of an autoregressive language model in action2.

example
The quick brown fox

In order to achieve this, the language model will need to convert the input text into a list of numbers called a vector, that stores a information about a word. This process is called tokenization and is important as computers don't understand language the way humans do and can't intuitively know the meaning or sentiment of a word.

However, they are excellent at handling numbers. So, by representing words as vectors that encapsulate textual information such as semantic meaning, similarities with other words, contextual information, grammatical properties, we can feed this information into machine learning models that can then process, analyse, or even generate language.

What is Matrix Multiplication?​

At the heart of machine learning lies an operation called matrix multiplication, which underpin many of the key operations used in machine learning. Matrix multiplication is the process of taking a grid of numbers called a matrix and using it to transform the input text vector from one vector to another.

A:1234B:56=C:Result:

This transformation turns one representation of our input text into another, rotating and skewing it in space until it looks completely different. By transforming the input text in this way (interspersed with simple nonlinear transformations), we can capture the process of generating new text from old, by viewing it as a complicated transformation in a high-dimensional space.

When it comes to the forward operation of a machine learning model, the most resource-intensive step is computing the results of matrix multiplications[2]. This is where the role of GPUs becomes pivotal. Now, it's important to understand that matrix multiplications have a unique characteristic: they're inherently parallelisable.

In the example above, clicking the "Next Step" button only calculates a single element of the output vector. Yet, each single calculation isn't dependent on the other. This means, if we have N computing units available, we could potentially compute N elements simultaneously, leading to a significant boost in the model's operational speed.

Here's where the difference between CPUs and GPUs becomes evident. CPUs are primarily designed to execute a limited set of operations at lightning speed, making them unsuitable for such parallel tasks. GPUs, however, are specifically engineered for these extensive parallel workloads, making them indispensable in the realm of machine learning. Thus, the solution to the NVIDIA mystery.

GPU Types - Which ones to get?​

Why choose NVIDIA when there are numerous GPU providers out there? The consistent preference for NVIDIA in the machine learning arena can be attributed to its software. NVIDIA's CUDA software stack stands out as the most mature and widely-adopted platform. Notably, it seamlessly integrates with modern deep learning libraries like PyTorch, JAX, and Tensorflow. Programming with CUDA is straightforward, and the powerful abstraction layers built atop it make the process even more efficient.

NVIDIA manufactures two distinct types of GPUs: those designed for consumers and those tailored for data centers. The most recent and advanced consumer GPU series for deep learning is the RTX 40xx. On the other hand, NVIDIA's datacenter GPUs, which are available through cloud providers, represent a pricier yet significantly more potent option.

The A100, for exmple, is a previous generation datacenter GPU that was foundational in the training and inference of Large Language Models. The latest generation, the H100, is even more powerful. If you are looking for a comprehensive analysis on which consumer GPU to invest in for machine learning development, you can read more about it here.

Why isn't my model running on my GPU?​

The most common and most dreaded experience people have when working with deep learning on GPUs is the Out Of Memory (OOM) error. This occurs when the model that you're trying to work with is too large for the memory on your GPU.

So what are your options when you get an OOM error? To most people, the most straightforward option is to procure a better GPU or rent one from a cloud provider, but this is often costly and unneccessary. The more sustainable alternative is to optimise your model.

This refers to the process of making your model smaller, faster, and more efficient. There are many different inference optimisation techniques that we use to bring you the best performance on our Titan Takeoff Server. As this is a huge topic, and we'll be writing more about it in the future, so do stay tuned!

Conclusions​

In this post, we've seen how GPUs are the best option for machine learning workloads. We've talked about what GPUs are available, and how to choose between them. Finally, we've talked about the importance of inference optimization, to make sure that your model is running as efficiently as possible.

Footnotes​

  1. https://edition.cnn.com/2023/08/23/tech/nvidia-earnings-ai-chips/index.html ↩

  2. Simplified: in practise, the generated fragments don't correspond to words, but instead text fragments, called tokens: this process is called tokenization. For an example of how words are broken down, see openAI's tokenization demo. ↩

Building with TitanML: Summarise Arxiv Papers Like a Pro

· 8 min read
Hamish Hall
Machine Learning Engineer, TitanML

The pace of ML research is accelerating, and the amount of information available is growing exponentially. It's becoming increasingly difficult to keep up with the latest developments in your field, let alone the wider world of research. The TitanML platform incorporates the techniques from this fast-moving field to make it easy, fast, and efficient to build NLP applications. To help us keep up with the firehose of information, we can use NLP to summarise and answer questions about papers.

hello
Interact with Arxiv papers: see the demo at http://54.167.108.88:8501/