Chat with India's Budget 2024 (Part I): LangChain-Free RAG on local CPU
Chat with India's Budget 2024 (Part I): LangChain-Free RAG on local CPU

Abhishek Acharya
Published on: 01-Sep-2024
Last Updated on: 01-Sep-2024

1. Introduction
Chatbot is a software application designed to simulate humanlike coversations. It is capable of answering questions, getting external information, even solving complex coding problems. In this tutorial we will build a custom chatbot to chat with our own document. The document we will use is India's Budget 2024 speech presented by Hon’ble Finance Minister Smt. Nirmala Sitharaman on July 23, 2024 [1]. The budget speech is very long and descriptive. Thus it is difficult for a common person to study and understand the the content in its entirety. But if we use a custom chatbot, we will be able to absorb most of it.
2. Overview: A Langchain-free approach

Developing a custom chatbot has become easier after the introduction of various open-source frameworks like Langchain. These frameworks are great for prototype development, but they lack customization and suffer from niche documentation [2]. So what's the alternative? Can we make a custom chatbot using only raw python and less lines of code? Absolutely! This tutorial consists of zero framework and minimal dependencies.

But wait, there's more. We also need an inference model. Using a well trained AI model often requires an API key which is behind a paywall [3]. And if you wish to use an open-source model like Llama [4] or application like Ollama [5] you need a GPU. On the other hand, hosting such applications on workhorse servers like AWS has become pretty scary due to their recent surprise bills [6]. Do we have any way forward from this muddle?

Our study shows that there are ways to bypass such constrains. The objective of this tutorial is to discuss those alternatives. We have used Llamafile to demonstrate how one can use a consumer grade laptop (7 years old i5-4670s in my case) to chat with one's personal data [7]. For the uninitiated, LLamafile is an amazing blessing for the developers community. Mozilla launched an open source project called Llamafile which removes all the complexity of a full-stack Large Language Model (LLM) chatbot by converting an AI model to an executable that would run anywhere [8,9,10]. It democratizes the use of LLMs in your local environment without GPU. Let's see how we can utilize their remarkable initiative.

The overall structure of this tutorial is as follows:

Chatbot Overview
Fig. 1   Chatbot overview The budget speech is split into smaller chunks. The chunks are embedded and saved in a vector store. When a user provides a query, it is embedded and passed into the vector store. A similarity search is executed, retrieving the relevant chunks. The context and query are then merged with a qna instruction and passed into the llamafile. The llamafile executes the AI model on the CPU and returns the answer.
3. Prerequisites
# Download a Llamafile
$ wget +o TinyLlama-1.1B-Chat-v1.0.F16.llamafile
# Install the required packages
$ pip install -q docx2txt faiss-cpu requests scikit-learn
# Make sure the document is available
$ ls docs | grep budget
$ budget_2024.docx

4. Building the Custom Chatbot
4.1.   Load the Document

First create a document loader to load documents from a folder. The following loader is not necessary for this tutorial but it is worthwhile to see how one can load multiple documents at once.

import os
from docx2txt import docx2txt

class SimpleDirectoryLoader(object):
    def __init__(self, directory, allowed_extensions=None): = directory
        self.allowed_extensions = allowed_extensions
        self.handlers = {
            '.docx': self.handle_docx,
        } = []
    def load_files(self):
        for filename in os.listdir(
            ext = os.path.splitext(filename)[-1]
            if (self.allowed_extensions is None or ext in self.allowed_extensions) and ext in self.handlers:
            elif self.allowed_extensions is not None and ext not in self.allowed_extensions:
                # print(f"Skipping {filename} with extension {ext}")
                print(f"No handler for {ext} files")

    def handle_docx(self, filename):
        text = docx2txt.process(os.path.join(, filename)){'filename': filename, 'extension': '.docx', 'data': text})

# Usage
allowed_extensions = ['.docx']
budget_loader = SimpleDirectoryLoader('path/to/your/docs/directory', allowed_extensions=allowed_extensions)
budget_doc = budget_loader.load_files()

# Print first 100 chars of the document

4.2.   Chunk the Docs

A document can contain hundreds of pages each having thousands of words. If we feed the bot with all of our document contents in one shot, we are definitely going to hit the maximum input context length. Hence we need to create a list of smaller chunks from the document. We create a chunk class to implement chunks by fixing total number of words in a chunk along with overlap.

Note: Overlapping is not neccessary but sufficient to provide relevant chunks when the context gets split between more than one chunk.

import re
from typing import List, Dict

class ChunkManager(object):
    def __init__(self, docs: List[Dict[str, str]]):
        self.combined_text = self._combine_texts(docs)
    def _combine_texts(self, docs: List[Dict[str, str]]) -> str:
        combined_text = ' '.join(doc['data'] for doc in docs)
        return re.sub(r'\n+', ' ', combined_text)

    def chunk_by_words_with_overlap(self, chunk_size: int, overlap: int) -> List[str]:
        words = self.combined_text.split()
        chunks = []
        step = chunk_size - overlap

        if overlap < 0:
            raise ValueError("Overlap must be a non-negative integer.")
        if step <= 0:
            raise ValueError("Overlap must be smaller than chunk size.")

        for i in range(0, len(words), step):
            chunk = ' '.join(words[i:i + chunk_size])
        return chunks

# Usage
chunk_size = 250
overlap = 100

chunks = ChunkManager(docs=budget_doc).chunk_by_words_with_overlap(chunk_size, overlap)

4.3.   Create Embeddings and save in a Vector Store

Embedding is the numerical representation of the data which makes it easier for algorithms to understand and analyze the relationships between different pieces of information. It converts the data into fixed-size of vectors. After that we can use various search algorithm like cosine similarity to find the relevant data.

In this tutorial, we use TF-IDF embeddings from scikit-learn library, and FAISS, a cpu intesive vector storage from Facebook Research, for storing the vectors.

Note: One can use any open source embedding models (from Hugging Face) or API (Gemini, OpenAI etc.) which are dense embedding models. We have used sparse embedding from TF-IDF.

import os
import faiss
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

class FAISS(object):
    def __init__(self, max_features=768):
        self.vectorizer = TfidfVectorizer(max_features=max_features)
        self.index = None
        self.texts = []

    # Create Embeddings
    def fit(self, texts):
        self.texts = texts
        embeddings = self.vectorizer.fit_transform(texts).toarray().astype('float32')
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        return embeddings

    # Function for Similarity Search through Context and Query
    # Futher we fetch top k similar documents based on the query
    def search(self, query, k=10):
        query_embedding = self.vectorizer.transform([query]).toarray().astype('float32')
        distances, indices =, k)
        similar_texts = [self.texts[idx] for idx in indices[0]]
        return similar_texts

    def save(self, dir_path, index_name='faiss'):
        os.makedirs(dir_path, exist_ok=True)

        with open(os.path.join(dir_path, f'{index_name}.pkl'), 'wb') as f:
            pickle.dump({'vectorizer': self.vectorizer, 'texts': self.texts}, f)
        faiss.write_index(self.index, os.path.join(dir_path, f'{index_name}.index'))

    def load(self, dir_path, index_name='faiss'):
        with open(os.path.join(dir_path, f'{index_name}.pkl'), 'rb') as f:
            data = pickle.load(f)
            self.vectorizer = data['vectorizer']
            self.texts = data['texts']
        self.index = faiss.read_index(os.path.join(dir_path, f'{index_name}.index'))

# Usage
faiss_index = FAISS()

# Save the embeddings for future use'path/to/your/embeddings/directory', index_name='budget_embeddings')

# Load the Budget Index
budget_index = FAISS()
budget_index.load('path/to/your/embeddings/directory', index_name='budget_embeddings')

4.4.   Similarity Search

We search through the embeddings using FAISS' in-built similarity search function.

# Query the Budget Index for similar documents
query = "Who represented the Budget 2024?"

no_of_docs_to_fetch = 10    # Get top 10 similar documents
context_chunks =, k=no_of_docs_to_fetch)

4.5.   Prompt Engineering

Before we proceed to the chatting part, we need to specify the model with an instruction on how the bot will behave. This is very important because we have seen that the quality of the response heavily depends on the framing of this prompt.

Essentially, our prompt will have 3 parts: Instruction, Context and the Query.

prompt = """You are an expert in the analysis of Government Budgets.
The following comtexts will be from India's Budget 2024.
The user will ask questions about the Budget 2024 and you have to provide the answers based on the context.
Answer in a professional tone and use bullets and paragraphs to make the answer more readable.
If you can't find the answer in the context, reply with "I am sorry, I couldn't find the answer to your question.".
Don't put any external information in the answer.

Context: {context}

Question: {question}

Answer: Your answer goes here.

# Construct the prompt
prompt = prompt.format(context=context_chunks, question=query)

4.6.   Run Llamafile on CPU

We have to start the Llamafile as a server through terminal.

# Make the downloaded Llamafile executable
$ chmod +x tiny.llamafile
# Launch Llamafile as a server
$ ./tiny.llamafile  --nobrowser --server --port 8080 -c 4096
# This will load the model and start the server in the terminal 
$ llama server listening at

4.7.   Chat with the Budget Bot

Now we can chat with our custom budget bot using the prompt we have created.

import requests

class ChatLlamaFile(object):
    def __init__(self, host="localhost", port=8080): = host
        self.port = port
        self.base_url = f"http://{host}:{port}"
        self.chat_url = f"{self.base_url}/v1/chat/completions"
        self.headers = {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer no-key'
    def chat(self, prompt, query):
        data = {
            "messages": [
                    "role": "system",
                    "content": prompt
                    "role": "user",
                    "content": query
        response =, headers=self.headers, json=data)
        return response.json()

# Let's Chat
response = ChatLlamaFile().chat(prompt, query)
answer = response.get('choices')[0].get('message').get('content')

print("Query:", query)
print("Answer:", answer)

Example response:

Query: Who represented the Budget 2024?  
Answer: The Budget 2024 was presented by Finance Minister Nirmala Sitharaman.

5. Summary and Conclusion
Congratulations! You have successfully built a custom chatbot to chat with India's Budget 2024 using your local CPU machine. You have learned how to load a document, make chunks, create embeddings, and then save them in a vector store. This allows subsequent operations like searching for similar documents, constructing a prompt and chatting with the budget bot.
5.1.   References