Fine-Tune GPT-3 on custom datasets with just 10 lines of code using GPT-Index

Dhanush Reddy - Feb 12 '23 - - Dev Community

The Generative Pre-trained Transformer 3 (GPT-3) model by OpenAI is a state-of-the-art language model that has been trained on a massive amount of text data. GPT3 is capable of generating human-like text, performing tasks like question-answering, summarization, and even writing creative fiction. Wouldn't it be cool if you feed GPT3 with your own data source and ask it questions.

In this blog post, we'll be going to see exactly that. Fine-tuning GPT-3 on custom datasets using the GPT-Index, and do it all with just 10 lines of code! GPT-Index does the heavy lifting, by providing an high level API for connecting external knowledge bases with LLMs.

Prerequisties

  • You need to have Python Installed on your system.
  • An OpenAI API Key. If you donot have a key create a new account on openai.com/api, and get $18 of free credits.

Code

I am not going into the details of how all this is working, as this would make this blog post longer and go against the title. You can refer to gpt-index.readthedocs.io/en/latest if you need to learn more.

  • Create a folder and open up it in your favorite code editor. Create a virtual environment for this project if needed.

  • For this tutorial, we need to have gpt-index and Langchain installed. Please download the versions i mention here so to avoid any breaking changes.

pip install gpt-index==0.4.1 langchain==0.0.83
Enter fullscreen mode Exit fullscreen mode

If your data sources are in form of PDF's also install PyPDF2

pip install PyPDF2==3.0.1
Enter fullscreen mode Exit fullscreen mode

Now create a new file main.py and add the following code:

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)

# save to disk
index.save_to_disk('index.json')
Enter fullscreen mode Exit fullscreen mode

For this code to run, you need to have your datasources be it PDF's, text files etc inside of a directory named as data in the same folder. Run the code after adding data.

Your project directory should look something like this:

project/
├─ data/
│  ├─ data1.pdf
├─ query.py
├─ main.py
Enter fullscreen mode Exit fullscreen mode
  • Now create another file named query.py and add the following code:
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex

# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')

print(index.query("Any Query You have in your datasets"))
Enter fullscreen mode Exit fullscreen mode

If you run this code you will be getting response from OpenAI with the query you have sent.

I have tried using this paper on Arxiv, as a datasource and asked for this query:

An Example Query to GPT3 with the coreesponding Response

Conclusion

With GPT-Index, it has become much easier to work with GPT-3 and fine-tune it with just a few lines of code. I hope this small post has shown you how to get started with GPT-3 on custom datasets using GPT-Index.

Of course, you can setup a simple frontend to give it a chatbot look like ChatGPT.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

. . . . . . . . . . . .
Terabox Video Player