(Updated at 20, February, 2022)

Introduction

In this post, I will fine-tune GPT-2, especially rinna's, which are one of the Japanese GPT-2 models. I am Japanese and most of my chat histories are in Japanese. Because of that, I will fine-tune "Japanese" GPT-2.

GPT-2 stands for Generative pre-trained transformer 2 and it generates sentences as the name shows. We could build a chatbot by fine-tuning a pre-trained model with tiny training data.

I will not go through GPT-2 in detail. I highly recommend the article How to Build an AI Text Generator: Text Generation with a GPT-2 Model on dev.to to understand what is GPT-2 and what is a language model.

git repository: chatbot_with_gpt2

I would appreciate the author of the following two articles.

Thanks to the first author, I could build my chatbot model. The sources in my git repository are almost constructed with his codes. I just summarized them. Thanks to the second author, I could go through GPT-2.

What is rinna

rinna is a conversational pre-trained model given from rinna Co., Ltd. and five pre-trained models are available on hugging face [rinna Co., Ltd.] on 19, February 2022. rinna is a bit famous in Japanese because they published rinna AI on LINE, one of the most popular SNS apps in Japan. She is a junior high school girl. We could take conversations on LINE.

I am not sure when the models are published on hugging face, but anyways, the models are available now. I will fine-tune rinna/japanese-gpt2-small whose number of parameters is small. By the way, I wanted to use rinna/japanese-gpt-1b whose number of parameters is around one billion, but I couldn't because of the memory capacity on google colab.

Process

I will suppose you have a google and git account and can use google colab.

Furthermore, I will use a chat history on LINE. If you have no account on the app, it is okay. All you have to do is prepare a chat history and modify the data. I know these processes are the hardest and most bothering things though. If you have the account, the following processes would work. Note that, if your LINE setting language is Japanese, you should change it to English until exporting a chat history because the following processes are supposing the setting language (not message language) is English.

Prepare the environment

At the end of this process, your google drive is constructed as follows.

MyDrive ---- chatbot_with_gpt2.ipynb
           |
           |- config
           |    |- general_config.yaml
           |
           |- data
                |- chat_history.txt

1: Clone chatbot_with_gpt2 repository on your local machine.

It is accomplished by running the following command on the git bash.

git clone https://github.com/ksk0629/chatbot_with_gpt2

2: Upload chatbot_with_gpt2/chatbot_with_gpt2.ipynb to the google drive.
3: Make a directory named config on your google drive and create general_config.yaml in the config folder.

general_config.yaml is as follows.

github:
  username: your_github_username
  email: your_email
  token: your_access_token
ngrok:
  token: anything

The ngrok block is needless, but it is needed to avoid an error below.

4: Get a chat history from LINE.

We can get the history by following the official announcement [Help centre - Chat history].

5: Make a directory named data on your google drive and move the chat history to the directory.

Prepare training data and build the model

1: Open chatbot_with_gpt2.ipynb on google colaboratory.
2: Run the cells in Preparation block.

The environment is prepared to get training data and build the model by running the cells.

3: Change chatbot_with_gpt2/pre_processor_config.yaml.

The initial yaml file is as follows.

line:
  initial:
    input_username: "input_username"
    output_username: "output_username"
    target_year_list: "[2016,2017,2018,2019,2020,2021,2022]"
  path:
    input_path: "/content/gdrive/MyDrive/data/chat_history.txt"
    output_path: "chat_history_cleaned.pk"

You have to change at least initial block. The meaning of each line is as follows.

input_username: a username of messages that you want to input into the model
output_username: a username of messages that you want the model to output
target_year_list: years that you want to use to train the model
input_path: path to the raw chat history
output_path: path to the cleaned data that is obtained by the following process

Note that, if you do not change output_path, then your training data would not be available after closing the notebook. Of course, it is available whilst the notebook is working.

4: Run the cell in Preprocessing data block.

The data is cleaned in the cell.

5: Change chatbot_with_gpt2/model_config.yaml.

The initial yaml file is as follows.

general:
  basemodel: "rinna/japanese-gpt2-xsmall"
dataset:
  input_path: "chat_history_cleaned.pk"
  output_path: "gpt2_train_data.txt"
train:
  epochs: 10
  save_steps: 10000
  save_total_limit: 3
  per_device_train_batch_size: 1
  per_device_eval_batch_size: 1
  output_dir: "model/default"
  use_fast_tokenizer: False

You have to change input_path in dataset block to the path to the cleaned data, which is specified in pre_processor_config.yaml. You can change basemodel to rinna/japanese-gpt2-small, but others (medium and 1b) would not work because of a lack of GPU memory as I mentioned in What is rinna section.

6: Run the cells in Training data preparation and Building model block.

That is all! After running this cell, all you have to do is wait for a while. You would see your model file in the directory that is specified in model_config.yaml.

Let's talk to the model

Again, all you have to do is run the only one cell in Talking with the model block. Then, the source code is running and you could talk with the model, like the following.

Conclusion

I fine-tuned GPT-2 with my chat history on LINE. I certainly did it, but there are the following problems as you could see in Let's talk to the model section.

There is unnecessary line Setting 'pad_token_id' to 'eos_token_id':2 for open-end generation. in each conversation.
There are some tokens, like <br:, [<unk>hoto]<br///, and <br/ゥ>, that disturb coherence sentence.
The model did not reply well.

The first response

帰ったんか
おつかれさま!

looks quite good because "おっす" means "Hey" and the response means "You are home. You’ve got to be exhausted". Something like these. But the others look wrong. To improve the model, I could clean training data more and I need to understand GPT-2 and the source codes.

If you have any suggestions, comments, or questions about this article, please comment below. I'd appreciate it.

My own chatbot by fine-tuning GPT-2