Have you ever used ChatGPT, Gemini, Claude, or any other generative AI application and wondered how you could build something similar on your own? If you’ve ever wanted to create a chatbot that can answer questions based on specific documents, data, or context, you’re in the right place. In this guide, we’ll walk through the process of developing a local AI chatbot using Ollama and LangChain. Whether you’re interested in personal projects or professional applications, this tutorial will equip you with the knowledge to get started.

What are LLMs?

LLMs, or Large Language Models, are a type of artificial intelligence model designed to process and generate human-like text based on the input it receives. These models are trained on vast amounts of text data and use machine learning techniques to understand language patterns, context, and semantics. They are the foundation of many modern AI applications, such as chatbots, virtual assistants, and automated content generation tools. LLMs can perform a variety of tasks, including text generation, translation, summarization, and answering questions, making them highly versatile in natural language processing (NLP).

What is LangChain?

LangChain is a framework designed to simplify the development of applications that involve LLMs. It provides tools to connect LLMs with external data sources, enabling them to interact with documents, databases, APIs, and more. LangChain helps developers create applications that combine document retrieval, language understanding, and reasoning, enabling the creation of advanced AI applications like chatbots, personalized assistants, or knowledge-based systems. It also supports functionalities like chaining different models together, integrating memory to maintain context across interactions, and handling complex workflows efficiently.

What is Ollama?

Ollama is a tool that allows users to run large language models (LLMs) locally on their machines without needing extensive cloud infrastructure. It provides a lightweight interface for downloading, serving, and interacting with different AI models directly on a local server. Ollama enables developers to build AI applications, such as chatbots or generative AI tools, while maintaining control over data privacy and performance by keeping the computations local. It’s particularly useful for users who want to use LLMs without relying on internet-based services or cloud infrastructure.

Understanding the Process: Retrieval-Augmented Generation (RAG)

Before getting into the technical details, it’s important to understand the core process behind our AI chatbot. We’ll be using Retrieval-Augmented Generation (RAG), a powerful method that combines document retrieval with generative AI. Here’s a breakdown of how it works:

Document Loading: Extract text from documents and prepare it for processing.
Chunking: Large documents are divided into smaller, manageable chunks. This makes it easier for the system to handle and retrieve relevant pieces of information.
Embedding and Storage: Each text chunk is transformed into embeddings — numerical representations that capture the semantic meaning of the text. These embeddings are then stored in a vector database, allowing for fast and efficient retrieval.
Model Initialization: Set up the LLM, which will generate responses based on the retrieved document content. We also define the prompt template to guide how the model should respond to user queries.
Interactive Chain: Implement an interactive loop to handle user queries. The chatbot utilizes semantic search or similarity-based methods provided by LangChain to retrieve relevant information from the vector database and generates answers based on the context of the retrieved documents.


RAG: Process

Getting Started

Note: You should atleast have 8GB of RAM for a decent performance.

Install the necessary libraries.

pip install langchain-text-splitters langchain-chroma langchain-community ollama

Create a new Jupyter notebook (.ipynb) to manage the code more effectively and avoid re-running the entire process when making changes.

(Optional) Suppress warnings to keep the output clean.

def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Set Up Ollama

Get the download and installation instructions for Ollama from here.

Once installed, start Ollama by running the following command in your command line:

ollama serve

This will start a localhost serverm generally on port 11434. You can check the running status by visiting http://localhost:11434/.

Pull Required Models

Browse the models from here*. We will be using phi3.5 due to it being comparatively lightweight.

**Note:* You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models. Source.

Now pull the model (here phi3.5*)*

ollama pull phi3.5

Import Libraries

Import all required libraries.

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.memory import ConversationBufferMemory

Load and Process Documents

Store all the files you intend to use in the **./context/ directory. (You can, of course rename the directory, but make sure to update the code accordingly.)

Define the directory containing your PDF files (or text files or any other format) and load them using LangChain’s PyPDFLoader or any other document loader, depending on the type of document you’re using.

directory = './context/'
all_documents = []

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        filepath = os.path.join(directory, filename)
        loader = PyPDFLoader(filepath)
        documents = loader.load()
        all_documents.extend(documents)

Chunking

To handle large documents, split them into smaller chunks using the RecursiveCharacterTextSplitter*.*

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(all_documents)

The smaller the chunk_size, the more number of chunks which means it will take more time and resources to complete the process.

Create Embeddings and Store

Load the embedding function (SentenceTransformerEmbeddings) and model (all-MiniLM-L6-v2). There are more models in sentence-transformers. Alternatively, use you can also use embedding models provided by Ollama.

Note:

If you intend to use a embedding model from Ollama, make sure you pull that model too.

Higher size of your embedding model will consume more time and resources.

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Create embeddings for the document chunks and store them using Chroma or any other VectorStore. We’ll make the code such as, if an existing vector database is not found, it will create a new one.

if os.path.exists('chroma_db'):
    # Load
    vectorstore = Chroma(embedding_function=embeddings, persist_directory="chroma_db")
else:
    # If not found -> throw error -> create new
    vectorstore = Chroma.from_documents(texts, embeddings, persist_directory="chroma_db")

Set Up the LLM

Initialize the pulled model. As mentioned above, you can use a different model here (just make sure you’ve pulled it from Ollama).

llm = Ollama(model="phi3.5")

Create a Prompt

Create a prompt template to ensure the model stays relevant to the context and answers in a structured manner.

prompt_template = """
You are an expert. Your role is to provide clear, concise, and accurate advice only based on the information from the provided documents and previous conversations with the user. If you don't know the answer, just say that you don't know, definitely do not try to make up an answer.

Previous conversations:
{history}

Document context:
{context}

Question: {question}
"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["history", "context", "question"]
)

Create a Chaining function

Use RetrievalQA to make the model answer queries related to the provided documents or data, without addressing irrelevant questions, and keeping the chat history in memory. You can also use ConversationalRetrievalChain or any other chaining methods, but not all may support ignoring irrelevant questions or maintaining memory.

def chainingFunction():
    chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorstore.as_retriever(),
                                    chain_type_kwargs={
                                        "prompt": prompt,
                                        "memory": ConversationBufferMemory(
                                            memory_key="history",
                                            input_key="question"),
                                    }, 
                                    return_source_documents=False) # Chaining method

Based on your requirements you can use chain_type="stuff”, *map_reduce, refine, or map_rerank.*

stuff*:* Uses the full document content for generating responses, suitable for scenarios where the entire document context is needed.
map_reduce*:* Splits the document into chunks, processes each chunk separately, and then combines the results to generate a final response, ideal for handling large documents.
refine*:* Iteratively refines answers by using the initial response to guide further queries and obtain more precise results.
map_rerank*:* Ranks retrieved chunks based on their relevance to the query and uses the top-ranked chunks to generate a response, optimizing for accuracy in information retrieval.

Breaking down rest of the parameters below:

retriever=vectorstore.as_retriever(): Provides the method for retrieving relevant documents. vectorstore is an instance of a vector store (Chroma in this case) that retrieves documents based on similarity search.
chain_type_kwargs: A dictionary containing additional arguments specific to the chain_type:
"prompt": PROMPT: Sets the prompt template used to guide the language model's responses. PROMPT defines how the model should format its answers based on context and previous conversations.
"memory": ConversationBufferMemory(memory_key="history", input_key="question"): Configures memory for the conversation. ConversationBufferMemory stores the history of the conversation (memory_key="history") and links it to the current question (input_key="question"), allowing the model to maintain context across interactions.
return_source_documents=False: Determines whether to return the source documents along with the response. Setting it to False means only the generated response will be returned, not the documents used for retrieval.

To keep the chaining in an infinite loop, call the chain method inside an infinite while loop, including an exit condition as well.

def chainingFunction():
    chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorstore.as_retriever(),
                                    chain_type_kwargs={
                                        "prompt": PROMPT,
                                        "memory": ConversationBufferMemory(
                                            memory_key="history",
                                            input_key="question"),
                                    }, 
                                    return_source_documents=False) # Chaining method

    while True:
        query = input("Question: ")

        print("User:", query, "\n")
        if query.lower() in ["quit","exit","bye"]:
            print("Bot: Goodbye!")
            break

        result = chain.invoke(query)

        print("Bot:", result["result"], "\n\n")

This while loop will run indefinitely until the user says bye, exit, or quit.

Finally, call this function.

chainingFunction()

Full code:

def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.memory import ConversationBufferMemory

directory = './context/'
all_documents = []

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        filepath = os.path.join(directory, filename)
        loader = PyPDFLoader(filepath)
        documents = loader.load()
        all_documents.extend(documents)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(all_documents)

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

if os.path.exists('chroma_db'):
    # Load
    vectorstore = Chroma(embedding_function=embeddings, persist_directory="chroma_db")
else:
    # If not found -> throw error -> create new
    vectorstore = Chroma.from_documents(texts, embeddings, persist_directory="chroma_db")

llm = Ollama(model="phi3.5")

prompt_template = """
You are an expert. Your role is to provide clear, concise, and accurate advice only based on the information from the provided documents and previous conversations with the user. If you don't know the answer, just say that you don't know, definitely do not try to make up an answer.

Previous conversations:
{history}

Document context:
{context}

Question: {question}
"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["history", "context", "question"]
)

def chainingFunction():
    chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorstore.as_retriever(),
                                    chain_type_kwargs={
                                        "prompt": prompt,
                                        "memory": ConversationBufferMemory(
                                            memory_key="history",
                                            input_key="question"),
                                    }, 
                                    return_source_documents=False) # Chaining method

    while True:
        query = input("Question: ")

        print("User:", query, "\n")
        if query.lower() in ["quit","exit","bye"]:
            print("Bot: Goodbye!")
            break

        result = chain.invoke(query)

        print("Bot:", result["result"], "\n\n")

chainingFunction()

Note: Make sure to do this in .ipynb format to avoid text processing again and again.

Alternatively, try to load the embeddings from the vector store first. If it results in an error, then perform all the text processing and create the embedding vector database as shown below.

def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.memory import ConversationBufferMemory

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

if os.path.exists('chroma_db'):
    # Load
    vectorstore = Chroma(embedding_function=embeddings, persist_directory="chroma_db")
else:
    # If not found -> throw error -> process data -> create new
    directory = './context/'
    all_documents = []

    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            filepath = os.path.join(directory, filename)
            loader = PyPDFLoader(filepath)
            documents = loader.load()
            all_documents.extend(documents)

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    texts = text_splitter.split_documents(all_documents)

    vectorstore= Chroma.from_documents(texts, embeddings, persist_directory="chroma_db")

llm = Ollama(model="phi3.5")

prompt_template = """
You are an expert. Your role is to provide clear, concise, and accurate advice only based on the information from the provided documents and previous conversations with the user. If you don't know the answer, just say that you don't know, definitely do not try to make up an answer.

Previous conversations:
{history}

Document context:
{context}

Question: {question}
"""

prompt= PromptTemplate(
    template=prompt_template, input_variables=["history", "context", "question"]
)

def chainingFunction():
    chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=vectorstore.as_retriever(),
                                    chain_type_kwargs={
                                        "prompt": prompt,
                                        "memory": ConversationBufferMemory(
                                            memory_key="history",
                                            input_key="question"),
                                    }, 
                                    return_source_documents=False) # Chaining method

    while True:
        query = input("Question: ")

        print("User:", query, "\n")
        if query.lower() in ["quit","exit","bye"]:
            print("Bot: Goodbye!")
            break

        result = chain.invoke(query)

        print("Bot:", result["result"], "\n\n")

chainingFunction()

Note: Based on your device specifications it would take time accordingly to run.

By following these steps, you have created a local AI chatbot capable of answering questions or engaging in conversations based on the provided data or context. This framework can be adapted for various use cases, including legal assistance*, AI psychiatrist, customer support, or knowledge retrieval, providing a solid foundation for your needs.

Check out this legal assistant I developed using some Indian legal texts.

Thanks for reading! Be sure to check out my personal portfolio website and GitHub.

Happy coding!

Create Your Own Local AI Chatbot with Ollama and LangChain