Chatbot is a cutting edge technology that has revolutionized the way we interact with computers. It is capable of answering questions, getting external information, even solving complex coding problems. In the era of AI, chatbots have become an integral part of our daily lives. They are used in various industries like healthcare, finance, and customer service. In each industry, chatbots are used to provide instant support to customers, answer their queries, and help them make informed decisions.
In this tutorial, we will build a custom chatbot to chat with our own document. The document we will use is India's Budget 2024 speech presented by Hon’ble Finance Minister Smt. Nirmala Sitharaman on July 23, 2024 [1]. The budget speech is very long and descriptive. Thus it is difficult for a common person to study and understand the the content in its entirety. But if we use a custom chatbot, we will be able to absorb most of it.
Retrieval Augmented Generation (RAG) is a new paradigm in chatbot development [2]. It combines the best of both worlds: the retrieval-based chatbot and the generative chatbot. The retrieval-based chatbot is good at finding relevant information from a large corpus of text. The generative chatbot is good at generating human-like responses. RAG combines these two approaches to create a chatbot that can find relevant information and generate human-like responses.
Developing a custom chatbot has become easier after the introduction of various open-source frameworks like Langchain, Haystack, LlamaIndex etc. These frameworks are great for prototype development, but they lack customization and suffer from niche documentation [3]. Also they require a lot of dependencies and often the issues regarding the conflicts of the dependencies are not addressed properly [4]. Thus it is better to avoid these frameworks and use raw python code to build a custom chatbot with minimal dependencies.
On the other hand, using a well trained AI model often requires an API key which is behind a paywall with incremental costs [5]. However this problem has been addressed by our lovely open-source community. Introduction of Llama, Llava, Phi, Mistral, Mixtral has democratizes the use of LLMs in your local environment [6]. But the problem is that these models require a GPU to run. Again, hosting such applications on workhorse servers like AWS has become pretty scary due to their recent surprise bills [7]. So in this tutorial, we will show you how you can run any open-source model using Llamafile on Google Colab's free GPU.
For the uninitiated, LLamafile is an amazing blessing for the developers community. Mozilla launched an open-source project called Llamafile which removes all the complexity of a full-stack Large Language Model (LLM) chatbot by converting an AI model to an executable that would run anywhere [8,9,10]. It democratizes the use of LLMs in your local environment without GPU. We have already implemented that in our previous Google Colab provides free GPU for a limited time. In this tutorial, we will show you how you can run the same chatbot using Google Colab's free GPU.
The overall structure of this tutorial is as follows:
We need to download the required model in GGUF format along with the Llamafile executable. GGUF is specially designed to store inference models and perform well on consumer-grade computer hardware [11].
# Download Llamafile executable and the model
$ wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.12/llamafile-0.8.12
$ wget -O mistral-7b-instruct-q5.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_0.gguf
$ chmod +x llamafile
# Install the required packages
$ pip install -q docx2txt faiss-cpu requests scikit-learn
# Make sure the document is available
$ ls docs | grep budget
$ budget_2024.docx
First create a document loader to load documents from a folder. The following loader is not necessary for this tutorial but it is worthwhile to see how one can load multiple documents at once.
import os
from docx2txt import docx2txt
class SimpleDirectoryLoader(object):
def __init__(self, directory, allowed_extensions=None):
self.directory = directory
self.allowed_extensions = allowed_extensions
self.handlers = {
'.docx': self.handle_docx,
}
self.data = []
def load_files(self):
for filename in os.listdir(self.directory):
ext = os.path.splitext(filename)[-1]
if (self.allowed_extensions is None or ext in self.allowed_extensions) and ext in self.handlers:
self.handlers[ext](filename)
elif self.allowed_extensions is not None and ext not in self.allowed_extensions:
# print(f"Skipping {filename} with extension {ext}")
pass
else:
print(f"No handler for {ext} files")
return self.data
def handle_docx(self, filename):
text = docx2txt.process(os.path.join(self.directory, filename))
self.data.append({'filename': filename, 'extension': '.docx', 'data': text})
# Usage
allowed_extensions = ['.docx']
budget_loader = SimpleDirectoryLoader('path/to/your/docs/directory', allowed_extensions=allowed_extensions)
budget_doc = budget_loader.load_files()
# Print first 100 chars of the document
print(budget_doc[0]['data'][:100])
A document can contain hundreds of pages each having thousands of words. If we feed the bot with all of our document contents in one shot, we are definitely going to hit the maximum input context length. Hence we need to create a list of smaller chunks from the document. We create a chunk class to implement chunks by fixing total number of words in a chunk along with overlap.
Note: Overlapping is not neccessary but sufficient to provide relevant chunks when the context gets split between more than one chunk.
import re
from typing import List, Dict
class ChunkManager(object):
def __init__(self, docs: List[Dict[str, str]]):
self.combined_text = self._combine_texts(docs)
def _combine_texts(self, docs: List[Dict[str, str]]) -> str:
combined_text = ' '.join(doc['data'] for doc in docs)
return re.sub(r'\n+', ' ', combined_text)
def chunk_by_words_with_overlap(self, chunk_size: int, overlap: int) -> List[str]:
words = self.combined_text.split()
chunks = []
step = chunk_size - overlap
if overlap < 0:
raise ValueError("Overlap must be a non-negative integer.")
if step <= 0:
raise ValueError("Overlap must be smaller than chunk size.")
for i in range(0, len(words), step):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Usage
chunk_size = 250
overlap = 100
chunks = ChunkManager(docs=budget_doc).chunk_by_words_with_overlap(chunk_size, overlap)
print(chunks)
Embedding is the numerical representation of the data which makes it easier for algorithms to understand and analyze the relationships between different pieces of information. It converts the data into fixed-size of vectors. After that we can use various search algorithm like cosine similarity to find the relevant data.
In this tutorial, we use TF-IDF embeddings from scikit-learn library, and FAISS, a cpu intesive vector storage from Facebook Research, for storing the vectors [12].
Note: One can use any open source embedding models (from Hugging Face) or API (Gemini, OpenAI etc.) which are dense embedding models. We have used sparse embedding from TF-IDF.
import os
import faiss
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
class FAISS(object):
def __init__(self, max_features=768):
self.vectorizer = TfidfVectorizer(max_features=max_features)
self.index = None
self.texts = []
# Create Embeddings
def fit(self, texts):
self.texts = texts
embeddings = self.vectorizer.fit_transform(texts).toarray().astype('float32')
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatL2(dimension)
self.index.add(embeddings)
return embeddings
# Function for Similarity Search through Context and Query
# Futher we fetch top k similar documents based on the query
def search(self, query, k=10):
query_embedding = self.vectorizer.transform([query]).toarray().astype('float32')
distances, indices = self.index.search(query_embedding, k)
similar_texts = [self.texts[idx] for idx in indices[0]]
return similar_texts
def save(self, dir_path, index_name='faiss'):
os.makedirs(dir_path, exist_ok=True)
with open(os.path.join(dir_path, f'{index_name}.pkl'), 'wb') as f:
pickle.dump({'vectorizer': self.vectorizer, 'texts': self.texts}, f)
faiss.write_index(self.index, os.path.join(dir_path, f'{index_name}.index'))
def load(self, dir_path, index_name='faiss'):
with open(os.path.join(dir_path, f'{index_name}.pkl'), 'rb') as f:
data = pickle.load(f)
self.vectorizer = data['vectorizer']
self.texts = data['texts']
self.index = faiss.read_index(os.path.join(dir_path, f'{index_name}.index'))
# Usage
faiss_index = FAISS()
faiss_index.fit(chunks)
# Save the embeddings for future use
faiss_index.save('path/to/your/embeddings/directory', index_name='budget_embeddings')
# Load the Budget Index
budget_index = FAISS()
budget_index.load('path/to/your/embeddings/directory', index_name='budget_embeddings')
We search through the embeddings using FAISS' in-built similarity search function.
# Query the Budget Index for similar documents
query = "Who represented the Budget 2024?"
no_of_docs_to_fetch = 10 # Get top 10 similar documents
context_chunks = budget_index.search(query, k=no_of_docs_to_fetch)
print(context_chunks)
Before we proceed to the chatting part, we need to specify the model with an instruction on how the bot will behave. This is very important because we have seen that the quality of the response heavily depends on the framing of this prompt.
Essentially, our prompt will have 3 parts: Instruction, Context and the Query.
prompt = """You are an expert in the analysis of Government Budgets.
The following comtexts will be from India's Budget 2024.
The user will ask questions about the Budget 2024 and you have to provide the answers based on the context.
Answer in a professional tone and use bullets and paragraphs to make the answer more readable.
If you can't find the answer in the context, reply with "I am sorry, I couldn't find the answer to your question.".
Don't put any external information in the answer.
Context: {context}
Question: {question}
Answer: Your answer goes here.
"""
# Construct the prompt
prompt = prompt.format(context=context_chunks, question=query)
We have to start the Llamafile as a server through terminal. But Google Colab generally does not allow to open server port. So we have to run the Llamafile during only inference time. We will use subprocess to run the Llamafile in the terminal directly.
import re
import subprocess
# This is due to the backticks in the prompt
def remove_backticks(input_string):
return input_string.replace('`', '')
def extract_answer(output):
match = re.search(r'\[INST\](.*?)\[/INST\](.*)', output, re.DOTALL)
if match:
extracted_content = match.group(1).strip()
answer = match.group(2).strip()
return extracted_content, answer
else:
return output, "Error in Prompt!"
def run_llamafile_subprocess(llamafile_name, model_path, context_length, num_gpus, prompt):
re_prompt = remove_backticks(prompt)
command = f"./{llamafile_name} -m {model_path} -c {context_length} -ngl {num_gpus} -p \"[INST]{re_prompt}[/INST]\""
result = subprocess.run(command, shell=True, capture_output=True, text=True)
extracted_content, answer = extract_answer(str(result.stdout))
return extracted_content, answer
Now we can chat with our custom budget bot we have created.
# Example usage:
llamafile_name = "llamafile"
model_path = "mistral-7b-instruct-q5.gguf"
num_gpus = 9999
context_length = 4096
extracted_content, answer = run_llamafile_subprocess(llamafile_name, model_path, context_length, num_gpus, prompt)
print("Question:", query)
print("Answer:", answer)
Example response:
Query: Who represented the Budget 2024?
Answer: The Budget 2024 was presented by Finance Minister Nirmala Sitharaman.