KL Divergence Demystified

Naoki
6 min readNov 5, 2018

What does KL stand for? Is it a distance measure? What does it mean to measure the similarity of two probability distributions?

Source: https://www.freeimages.com/photo/blue-strings-4-1542735

If you want to intuitively understand what the KL divergence is, you are in the right place, I’ll demystify the KL divergence for you.

As I will explain the KL divergence from the information theory point of view, you must know the entropy and cross-entropy concepts to fully comprehend this article. If you are not familiar with them, you may want to read the following two articles: one on entropy and the other on cross-entropy.

If you are ready, read on.

What does KL stand for?

KL in the KL divergence stands for Kullback-Leibler which represents the following two people:

They introduced the concept of the KL divergence in 1951 (Wikipedia).

What is the KL divergence?

The KL divergence tells us how well the probability distribution Q approximates the probability distribution P by calculating the cross-entropy minus the entropy.

As a reminder, I put the cross-entropy and the entropy formula as below:

The KL divergence can also be expressed in the expectation form as follows:

--

--