Background
Sometimes I just wonder: what if my PDFs could chat with me? It would make things much more easier, I would enjoy my studying more and I would have less doubts, as talking to them would make information clearer and simpler to find.
This is indeed possible, but generally things are a little bit more complicated than just opening Telegram and chatting with them... Or not? In today's post, we will explore the possibility to build your own PDF chat, leveraging Pinecone and AI21 APIs, Telegram bots.
Setup - Folder structure
Your folder will have to look like this:
.
|__utils.py
|__bot.py
|__file.pdf
|__sources.pdf
In utils.py
we will define all the useful functions, in bot.py
we will build the bot: this is not mandatory, it is just to keep things ordered.
Setup - Pinecone
Pinecone is a Vector database hosting platform, that enables scaling your vectorized storage and power AI/ML models (and much more, but this is what mainly interests us).
In case you do not have a Pinecone account, you just need to create it here: it is really simple and GitHub account are supported as a simple and straightforward sign-up method.
Once you created it, you can enjoy Pinecone's free plan, and explore things a little bit. You will need to:
- First click on "Indexes" on the navigation bar on the side of the page, and then proceed to creating your firs index. Name it in a reasonable way, that resonates with the storage intentions of your database. For this tutorial, we will name it
open-educational-resources
, as we will be building a bot based on the book "Making Open Educational Resources with and for PreK12: A Collaboration Toolkit for Higher Education" (available here for download in pdf format). - Second, click on "API Keys" on the navigation bar and copy the default one, if you do not want to create a custom.
- Last, go back to the index you created and copy the HOST link
Now head over to utils.py
and write:
import os
os.environ["PINECONE_API_KEY"] = "YOUR_API_KEY"
os.environ["PINECONE_ENVIRONMENT"] = (
"https://open-educational-resources-some.random.letters.pinecone.io" # HOST link
)
os.environ["USE_SERVERLESS"] = "False"
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
environment = (
os.getenv("PINECONE_ENVIRONMENT")
or "https://open-educational-resources-some.random.letters.pinecone.io"
)
Be careful with the naming of the variables used here, as it is really strict!
Setup - AI21
AI21 is an AI facility that is providing lots of services in terms of LLMs: one of them is their hosted API, which has a free trial and lots of other benefits. You can sign up with GitHub as well, here.
Once your signed up, you just need to generate and copy your first API key, and the magic is done!
Now head over again to utils.py
and type:
os.environ["AI21_API_KEY"] = "YOUR_API_KEY"
ai21_api_key = os.getenv("AI21_API_KEY")
Setup - Local environment
Now that you have everything you need in terms of hosted environments, let's have a quick look to your local one.
You need to run the following command to install all the necessary packages:
python3 -m pip install pinecone-client==3.0.0 langchain==0.1.1 langchain-community==0.0.13 huggingface-hub==0.22.2 transformers==4.30.2 sentence-transformers==2.2.2 pikepdf==8.11.2 pypdf==3.17.4
I suggest creating a virtual environment, a conda environment or, even better, a devcontainer (if you are a VS code user), in order to avoid conflicts with already installed versions of this packages.
AI responder architecture - Loading, embedding and storing your PDF
To load your PDF, you will need to leverage Langchain functions, and you will do it as follows:
# Import necessary modules
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
# Load and preprocess the PDF document
loader = PyPDFLoader("./Making_Open_Educational_Resources_PreK12.pdf")
documents = loader.load()
# Split the documents into smaller chunks for processing
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# Create embeddings and store them persistently
embeddings = HuggingFaceEmbeddings()
store = LocalFileStore("./cache/")
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings=embeddings,
document_embedding_cache=store,
namespace="EducationalResources",
)
AI responder architecture - VectorDB
It is time now to create a Pinecone VectorDB, and you can do it as follows:
# Import necessary modules
import pinecone
from langchain_community.vectorstores import Pinecone
# Set up the specs for Pinecone client
pc = pinecone.Pinecone(api_key="YOUR_API_KEY")
spec = pinecone.ServerlessSpec(cloud="aws", region="us-west-2")
index_name = "open-educational-resources"
index = pc.Index(index_name)
# Create the vectorDB
vectordb = Pinecone.from_documents(texts, cached_embeddings, index_name=index_name)
AI responder architecture - LLM setup and responder function
Now you can set up your AI21 LLM model like this:
from langchain_community.llms import AI21
llm = AI21(ai21_api_key=ai21_api_key)
And generate a responder architecture in this way:
from langchain.chains import ConversationalRetrievalChain
qa_chain = ConversationalRetrievalChain.from_llm(
llm, vectordb.as_retriever(search_kwargs={"k": 2}), return_source_documents=True
)
So, everything is set up now: we just need to create a function that takes the user's query as input and outputs a the LLM's answer based on the vectorized PDF, sending also the portion of the PDF in which it took the information in order to generate the response.
def AI_responder(query, with_source=True):
chat_history = []
result = qa_chain({"question": query, "chat_history": chat_history})
if with_source: ## return the relevant part of the PDF used to generate the answer
pgs = []
for i in result["source_documents"]:
idx = int(i.metadata["page"])
if idx not in pgs:
pgs.append(idx)
else:
pass
filename = "./Making_Open_Educational_Resources_PreK12.pdf"
filetosave = "./sources.pdf" ## this will be the pdf file with the sources we will send to the user
pdf = Pdf.open(filename)
new_pdf_files = [Pdf.new() for i in range(1)]
pages_def = []
for i in pgs:
if 0 < i < len(list(pdf.pages)) - 1:
pages_def.append(i - 1)
pages_def.append(i)
pages_def.append(i + 1)
elif i == 0:
pages_def.append(i)
pages_def.append(i + 1)
else:
pages_def.append(i - 1)
pages_def.append(i)
print(pgs)
# the current pdf file index
new_pdf_index = 0
# iterate over all PDF pages
for i in range(len(list(pdf.pages))):
if i in pages_def:
# add the n page to the new_pdf_index file
print(i)
new_pdf_files[new_pdf_index].pages.append(list(pdf.pages)[i])
else:
continue
new_pdf_files[new_pdf_index].save(filetosave)
return result["answer"] ## return the answer
Bot architecture - Create the bot
To run the bot, you will first need to install python-telegram-bot
library:
python3 -m pip install python-telegram-bot
Now, open Telegram and type into the search bar "Botfather": this will guide you to the father of all bots, which will create the bot user we are interested in.
Send to BotFather the /newbot
command, and it will prompt you to give your bot a username and then a name that ends with "bot". For today's tutorial, we will create a simple Neapolitan Pizzeria assistant, so we will simply go for "OpenSourceEduBot" as both the username and the bot name.
Once you are done with the naming, BotFather will send you the real thing that interests us: the API Token, through which we will interact with Telegram, sending messages and retrieving responses. Copy the Token, open bot.py
and paste it there:
from telegram.ext import *
#not a real token
TOKEN = "84890243:42u9iodfbgdjsbgdgiH"
Bot architecture - Define functions
Now we just need to define some function that handle the conversation with our bot, and then we will be ready to deploy it!
We will first create a start_command
function to greet the user when they enter the chat:
async def start_command(update, context):
await update.message.reply_text("Hi! I'm OpenSourceEduBot, your personal assistant committed to help you build Open Educational Resources. I'm based on the book 'Making Open Educational Resources with and for PreK12: A Collaboration Toolkit for Higher Education'. Just send a message and I'll do all what is in my power to help you!"))
Then we define just some fundamental functions, such as the error handling one and the one that filters unrecognized commands:
async def unrecognized_command(update,context):
text = update.message.text
if text.startswith("/start")==False:
await update.message.reply_text(f"I cannot understand the message:\n\"{text}\"\nAs my programmer did not insert it among the command I am set to respond: please check for misspelling/errors or contact the programmer if you feel anything is wrong/missing")
else:
pass
async def error_handler(update, context: CallbackContext) -> None:
await update.message.reply_text("Sorry, something went wrong.")
And now we are ready for "the big one", our AI responder function: this will take all the text messages sent by the user and respond to them according to the information contained in the PDFs.
from utils import AI_responder
async def AI_assistant_command(update, context):
text = update.message.text
await update.message.reply_text("Thanks for submitting a request! You personal assistant will be processing it: this may take a while...")
message = AI_responder(query=text)
# send the AI assistant response
await update.message.reply_text(emoji.emojize(f"Your personal assistant says:\n{message}\nHope this was useful! In the next message, you'll find the sources file"))
# send the PDF where you stored the source information
source_file = "./sources.pdf"
await update.message.reply_document(document=open(source_file, 'rb'))
Bot architecture - Build the application
Last but not least, we need to build our application, and to do so we will match each function we previously defined with a chat input, and then we will simply launch the bot.
if __name__ == '__main__':
import sys
try:
print("Bot is high and running")
# Build the application
application = Application.builder().token(TOKEN).build()
# Match predefined functions to chat input
## start command (command handler)
application.add_handler(CommandHandler('start', start_command))
## AI assistant command (message handler)
application.add_handler(MessageHandler(filters.TEXT, AI_assistant_command))
## Filter unwanted commands
application.add_handler(MessageHandler(filters.COMMAND, unrecognized_command))
# Handle errors
application.add_error_handler(error_handler)
# Run bot
application.run_polling(1.0)
except KeyboardInterrupt:
sys.exit(0)
Now you just need to run the bot from your computer (or, alternatively, look into remote hosting services such as pythonanywhere):
python3 bot.py
Head over to Telegram and start chatting... Enjoy🤗
What will you use this Telegram bot architecture for? Let me know in the comments below!!!
References
Huge kudos to Ruben Tak for the wonderful notebook I based my tutorial on.
Obviously, thanks to Langchain, Pinecone, AI21 and Telegram for making all this possible!