Neural Machine Translation

Neural Machine Translation with Attention Mechanism

How Does A Machine Translation Model Know Where To Look?

6 min readSep 28, 2021


This article reviews a paper titled: Neural Machine Translation By Jointly Learning To Align And Translate by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio.

In 2014, machine translation using neural networks emerged. Researchers adapted encoder-decoder (or sequence-to-sequence) architectures that encode a sentence in one language into a fixed-length vector and then decode it to another language.

However, the approach requires the encoder to compress all required information into a fixed-length vector, no matter how long the source sentence is, making it difficult for the model to handle long sentences. The performance of such an encoder-decoder model goes down sharply as the length of an input sentence increases.

The paper proposed an extension to overcome the limitation of the encoder-decoder architecture by letting the decoder access all hidden states, not just the final one from the encoder. Moreover, the author introduced the attention mechanism so that the decoder can learn how to use appropriate context to translate the source sentence into the target language.

The approach frees the encoder from compressing all required information into a fixed-length vector. As such, the length of the sentence becomes no longer a significant issue.

This article discusses the following topics:

  • Encoder-Decoder Bottleneck
  • Attention Mechanism
  • Experimental Results

Encoder-Decoder Bottleneck

In RNN encoder-decoder architecture, an RNN encoder processes an input sentence (a sequence of word vectors) to generate a fixed-length vector representing the input sentence. Then, an RNN decoder consumes the vector to produce a translation in the target language.

In general, an RNN encoder-decoder architecture looks like below: