Member-only story

Large-Scale Pre-Trained Language Models

BERT

How and Why Does It Use The Transformer Architecture?

7 min readFeb 6, 2022

BERT stands for Bidirectional Encoder Representations from Transformers. As the name suggests, it generates representations using an encoder from Vaswani et al.’s Transformer architecture. However, there are notable differences between BERT and the original Transformer, especially in how they train those models.

This article discusses the following:

Why unsupervised pre-training?
Masked language model (MLM)
Next Sentence Prediction (NSP)
Supervised fine-tuning

Why Unsupervised Pre-Training?

Vaswani et al. employed supervised learning to train the original Transformer models for language translation tasks, which requires pairs of source and target language sentences. For example, a German-to-English translation model needs a training dataset with many German sentences and corresponding English translations. Collecting such text data may involve much work, but we require them to ensure machine translation quality. There is not much else we can do about it, or can we?

We actually can use unsupervised learning to tap into many unlabelled corpora. However, before discussing unsupervised learning, let’s look at another problem with supervised representation learning.

The original Transformer architecture has an encoder for a source language and a decoder for a target language. The encoder learns task-specific representations, which are helpful for the decoder to perform translation, i.e., from German sentences to English. It sounds reasonable that the model learns representations helpful for the ultimate objective. But there is a catch.

If we wanted the model to perform other tasks like question answering and language inference, we would need to modify its architecture and re-train it from scratch. It is time-consuming, especially with a large corpus.

Would human brains learn different representations for each specific task? It does not seem so. When kids learn a language, they do not aim for a single task in mind. They would somehow understand…

Large-Scale Pre-Trained Language Models

BERT

How and Why Does It Use The Transformer Architecture?

Why Unsupervised Pre-Training?

Written by Naoki

No responses yet