To train a GPT-2 neural network, first of all we need to pre-process the data, in order to obtain a single .txt with a machine-learning compatible structure.
2.1 Google Colab
For the sake of simplicity and since the machine learning model we will use requires a GPU to work, we’re going to use Google Colab for the next step.
If you don’t know what Google Colab is, check this other article here.
2.2 Start the notebook
Open this Colab notebook and follow these steps:
- Run the first block of cells called under the “0️⃣ Init” chapter
- Press “Run Anyway” on the pop-up
- Make sure that the first command !nvidia-smi shows that a GPU is connected (p100 is suggested)
- If no GPU is connected, go to Runtime > Change Runtime type > Hardware accelerator > GPU
2.3 Load the data
To work with the data, we need to upload them on Colab, into the right folders.
Select all your .txt files and upload everything into the following notebook folder:
Get the file telegram_dump.json and upload it into the following notebook folder:
2.4 Parse the data
Now, run all the cells up until the block “2️⃣ Parse the data”.
Here we need to replace the variable “whatsapp_user_name” with your WhatsApp name, called <YourName> on the 1.1 chapter.
You can also change the date format parsing system if some of the exported data show a different format due to local time formatting.
So, for example, if my name is “Bob” and I’m from America, the code I should use is the following: