GPT (2018)
In 2018, OpenAI released the first version of GPT (Generative Pre-Trained Transformer) for generating texts as if humans wrote. The architecture of GPT is based on the original transformer’s decoder.
They trained GPT in two stages:
- Unsupervised Pre-training pre-trains GPT on unlabeled text, which taps into abundant text corpora.
- Supervised Fine-tuning fine-tunes the pre-trained model for each specific task using labeled data.
This article gives an overview of each stage.
Unsupervised Pre-training
The purpose of this stage is to train the model to learn about the structure of the language and to capture the statistical patterns present in the text dataset. In other words, it is not aiming at a specific language task but at improving the model’s understanding of the language itself. The model learns to predict the next word in the sequence based on the context from the previous words (aka. generative pre-training). It’s like a smartphone keyboard suggesting the next word to type.
More concretely, when pre-training the model with unlabelled texts, they feed a sequence of tokens (i.e., part of a sentence) to the model (a variant of the Transformer decoder) to predict the…