Introduction
Hi folks,
In today's article, we are going to solve a specific problem of OpenAI and LLM which is chatting with Open AI with multiple pdf documents with accurate responses.
I have been searching for this problem solution for the last 15 days and I have found many ways and tried the same but could not find the exact solution for the above problem. So I came up with a solution by reading multiple articles and watching videos.
Problem Statement
Create an OpenAI-based question-answering tool which can answer queries with multiple PDF documents.
Tech Stack
We are going to use Python as a programming language and some useful libraries to fix this issue.
- OpenAI
- Langchain
- FastAPI
- PyPDF2
- python-dotenv
- langchain_community
- FAISS (faiss-cpu)
Code
create a directory for the Fast Api app and in that directory create another directory where the PDF files will be stored
- main_dir
- docs
- main.py
main.py
import os
from fastapi import FastAPI, HTTPException,
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.memory.buffer import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores.faiss import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ["KMP_DUPLICATE_LIB_OK"]="True"
app = FastAPI(debug=True, title="Bot API", version="0.0.1")
text_folder = 'docs'
embedding_function = OpenAIEmbeddings()
pdf_docs = [os.path.join(text_folder, fn) for fn in os.listdir(text_folder)]
def get_pdf_text(pdf_docs):
text = ""
for pdf in pdf_docs:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
text += page.extract_text()
return text
def get_text_chunks(text):
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
return chunks
def get_vectorstore(text_chunks):
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
return vectorstore
def get_qa_chain(vectorstore):
llm = ChatOpenAI()
memory = ConversationBufferMemory(
memory_key='chat_history', return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(),
memory=memory
)
return conversation_chain, memory
text = get_pdf_text(pdf_docs)
text_chunks = get_text_chunks(text)
vectorstore = get_vectorstore(text_chunks)
qa_chain, memory = get_qa_chain(vectorstore)
@app.get("/ask-query")
async def query(query: str):
# Process query
resp = qa_chain.invoke(query)
if(len(resp['chat_history']) >= 6):
memory.clear()
return {"response": resp}
Conclusion:
This is a basic code which does the stuff as per my requirements. Please install all the requirements create a .env
file and add the API key to load in the Fast Api.
If you still face any issues then let's discuss them in the comment section.