What does KL stand for? Is it a distance measure? What does it mean to measure the similarity of two probability distributions?
If you want to intuitively understand what the KL divergence is, you are in the right place, I’ll demystify the KL divergence for you.
As I will explain the KL divergence from the information theory point of view, you must know the entropy and cross-entropy concepts to fully comprehend this article. If you are not familiar with them, you may want to read the following two articles: one on entropy and the other on cross-entropy.
If you are ready, read on.
What does KL stand for?
KL in the KL divergence stands for Kullback-Leibler which represents the following two people:
They introduced the concept of the KL divergence in 1951 (Wikipedia).
What is the KL divergence?
The KL divergence tells us how well the probability distribution Q approximates the probability distribution P by calculating the cross-entropy minus the entropy.
As a reminder, I put the cross-entropy and the entropy formula as below:
The KL divergence can also be expressed in the expectation form as follows: