THE TRANSFORMER SERIES

Transformer’s Encoder-Decoder

Understanding The Model Architecture

Naoki
11 min readDec 12, 2021

--

In 2017, Vaswani et al. published a paper titled “Attention Is All You Need” for the NeurIPS conference. They introduced the original transformer architecture for machine translation, performing better and faster than RNN encoder-decoder models, which were mainstream.

The transformer architecture is the basis for recent well-known models like BERT and GPT-3. Researchers have already applied the transformer architecture in computer vision and reinforcement learning. So, understanding the transformer architecture is crucial if you want to know where machine learning is making headway.

However, the transformer architecture may look complicated to those without much background.

Figure 1 of the paper

The paper’s author says the architecture is simple because it has no recurrence and convolutions. In other words, it uses other common concepts like an encoder-decoder architecture, word embeddings, attention mechanisms, softmax, and so on without the complication introduced by recurrent neural networks or convolutional neural networks.

The transformer is an encoder-decoder network at a high level, which is very easy to understand. So, this article starts with the bird-view of the architecture and aims to introduce essential components and give an overview of the entire model architecture.

Encoder-Decoder Architecture

The original transformer published in the paper is a neural machine translation model. For example, we can train it to translate an English sentence into a French sentence.

The transformer uses an encoder-decoder architecture. The encoder extracts features from an input sentence, and the decoder uses the features to produce an output sentence (translation).

--

--