LangChain
The Takeoff API also has an integration with LangChain, allowing you to inference LLMs or embed queries through the LangChain interface.
See also the official LangChain docs for the Titan Takeoff LLM and Embedding integrations.
Inferencing your LLM through LangChain​
Before making calls to your LLM, make sure the Takeoff Server is up and running. To access your LLM running on the Takeoff Server, import the TitanTakeoff LLM wrapper:
The TitanTakeoffPro
wrapper has been deprecated, but for backwards compatability purposes remains as an alias for TitanTakeoff
.
from langchain_community.llms import TitanTakeoff
llm = TitanTakeoff()
output = llm.invoke("What is the weather in London in August?")
print(output)
No arguments are needed to initialize the llm object if you haven't overwritten any of the default settings when you launched Takeoff. If you have, you can pass in the following parameters to the TitanTakeoff
object:
base_url
(str, optional): The base URL where the Takeoff Inference Server is listening. Defaults tohttp://localhost
.port
(int, optional): What port is Takeoff Inference API listening on? Defaults to 3000.mgmt_port
(int, optional): What port is Takeoff Management API listening on? Defaults to 3001.streaming
(bool, optional): Whether you want to by default use the generate_stream endpoint over generate to stream responses. Defaults to False. In reality, this is not significantly different as the streamed response is buffered and returned similar to the non-streamed response, but the run manager is applied per token generated.models
(List[ReaderConfig], optional): Any readers you'd like to spin up on. Defaults to [].
Specifying Generation Parameters​
You can also specify generation parameters when making a call to the LLM. The following example demonstrates how to do this:
llm = TitanTakeoff()
# A comprehensive list of parameters can be found at https://docs.titanml.co/docs/next/apis/Takeoff_inference_REST_API/generate#request
output = llm.invoke(
"What is the largest rainforest in the world?",
consumer_group="primary",
min_new_tokens=128,
max_new_tokens=512,
no_repeat_ngram_size=0,
sampling_topk=1,
sampling_topp=1.0,
sampling_temperature=1.0,
repetition_penalty=1.0,
regex_string="",
json_schema=None
)
print(output)
Streaming​
Streaming is also supported via the streaming flag:
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
llm = TitanTakeoff(streaming=True, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))
prompt = "What is the capital of France?"
output = llm.invoke(prompt)
print(output)
Chains​
Chains can also be used with the TitanTakeoff integration:
from langchain import PromptTemplate
llm = TitanTakeoff()
prompt = PromptTemplate.from_template("Tell me about {topic}")
chain = prompt | llm
output = chain.invoke({"topic": "the universe"})
print(output)
Spinning up Readers/Models​
Starting readers using TitanTakeoff Python Wrapper. If you haven't created any readers with first launching Takeoff, or you want to add another you can do so when you initialize the TitanTakeoff object. Just pass a list of model configs you want to start as the models
parameter.
# Model config for the llama model, where you can specify the following parameters:
# model_name (str): The name of the model to use
# device: (str): The device to use for inference, cuda or cpu
# consumer_group (str): The consumer group to place the reader into
# tensor_parallel (Optional[int]): The number of gpus you would like your model to be split across
# max_sequence_length (int): The maximum sequence length to use for inference, defaults to 512
# max_batch_size (int_: The max batch size for continuous batching of requests
llama_model = {
"model_name": "TheBloke/Llama-2-7b-Chat-AWQ",
"device": "cuda",
"consumer_group": "llama",
}
llm = TitanTakeoff(models=[llama_model])
# The model needs time to spin up, length of time need will depend on the size of model and your network connection speed
time.sleep(60)
prompt = "What is the capital of France?"
output = llm.invoke(prompt, consumer_group="llama")
print(output)
Embedding Documents/Prompts with Langchain​
Similarly, before calling the embedding wrapper ensure Takeoff is running. To access the embedding wrapper, import the TitanTakeoffEmbed wrapper:
from langchain_community.embeddings import TitanTakeoffEmbed
embed = TitanTakeoffEmbed()
output = embed.embed_query("What is the weather in London in August?", consumer_group="embed")
print(output)
Starting Embedding Readers/Models​
Starting readers using TitanTakeoffEmbed Python Wrapper. If you haven't created any readers with first launching Takeoff, or you want to add another you can do so when you initialize the TitanTakeoffEmbed object. Just pass a list of models you want to start as the models
parameter.
You can use embed.query_documents
to embed multiple documents at once. The expected input is a list of strings, rather than just a string expected for the embed_query
method.
# Model config for the embedding model, where you can specify the following parameters:
# model_name (str): The name of the model to use
# device: (str): The device to use for inference, cuda or cpu
# consumer_group (str): The consumer group to place the reader into
embedding_model = {
"model_name": "BAAI/bge-large-en-v1.5",
"device": "cpu",
"consumer_group": "embed",
}
embed = TitanTakeoffEmbed(models=[embedding_model])
# The model needs time to spin up, length of time need will depend on the size of model and your network connection speed
time.sleep(60)
prompt = "What is the capital of France?"
# We specified "embed" consumer group so need to send request to the same consumer group so it hits our embedding model and not others
output = embed.embed_query(prompt, consumer_group="embed")
print(output)