Neural Machine Translation

BLEU (Bi-Lingual Evaluation Understudy)

How do we evaluate a machine translation with reference sentences?

Naoki
7 min readOct 19, 2021

--

The article reviews “BLEU: a Method for Automatic Evaluation of Machine Translation”.

Human evaluations of machine translation (MT) requires skilled specialist and can take months to finish. It is an expensive and lengthy process.

The paper proposed an automatic MT evaluation that correlates well with human assessment. It is inexpensive and quick. They call it BLEU (Bi-Lingual Evaluation Understudy).

The main idea is that a quality machine translation should be closer to reference human translations. So, they prepared a corpus of reference translations and defined how to calculate the closeness metric to make quality judgments.

We discuss the following topics:

  • Modified n-gram Precision
  • Brevity Penalty
  • The BLEU Metric
  • A Python Example using NLTK
  • Word of Caution

Modified n-gram Precision

We humans can make many plausible translations of a given source sentence. Translated sentences may vary in word choices as there could be various synonyms with subtle nuance differences. We could change the ordering of words yet convey the same meaning.

How can we distinguish between good and lousy machine translations when we have many valid answers?

A good translation shares many words and phrases with reference translations. So, we can write a program that finds n-gram matches between an MT sentence and the reference translations.

Suppose an MT translates from a German source sentence to the following English sentence:

MT output 1: A cat sat on the mat.

Also, suppose we have the following reference translations (I’m taking the example references from the paper):

Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.

MT output 1 matches “A cat” (bigram) with reference 2 (we are ignoring the cases). It…

--

--