What does KL stand for? Is it a distance measure? What does it mean to measure the similarity of two probability distributions?
If you want to intuitively understand what the KL divergence is, you are in the right place, I’ll demystify the KL divergence for you.
As I’m going to explain the KL divergence from the information theory point of view, it is required to know the entropy and the cross-entropy concepts to fully apprehend this article. If you are not familiar with them, you may want to read the following two articles: one for the entropy and the other for the cross-entropy.
If you are ready, read on.
What does KL stand for?
KL in the KL divergence stands for Kullback-Leibler which represents the following two people:
They introduced the concept of the KL divergence in 1951 (Wikipedia).
What is the KL divergence?
The KL divergence tells us how well the probability distribution Q approximates the probability distribution P by calculating the cross-entropy minus the entropy.
As a reminder, I put the cross-entropy and the entropy formula as below:
The KL divergence can also be expressed in the expectation form as follows:
The expectation formula can be expressed in the discrete summation form or in the continuous integration form:
So, what does the KL divergence measure? It measures the similarity (or dissimilarity) between two probability distributions.
If so, is the KL divergence a distance measure?
To answer this question, let’s see a few more characteristics of the KL divergence.
The KL divergence is non-negative
The KL divergence is non-negative. An intuitive proof is that:
- if P=Q, the KL divergence is zero as:
- if P≠Q, the KL divergence is positive because the entropy is the minimum average lossless encoding size.