What is it? Is there any relation to the entropy concept? Why is it used for classification loss? What about the binary cross-entropy?
Some of us might have used the cross-entropy for calculating classification losses and wondered why we use the natural logarithm. Some might have seen the binary cross-entropy and wondered whether it is fundamentally different from the cross-entropy or not. If so, reading this article should help to demystify those questions.
The word “cross-entropy” has “cross” and “entropy” in it, and it helps to understand the “entropy” part to understand the “cross” part.
So, let’s review the entropy formula.
Review of Entropy Formula
My article Entropy Demystified should help the understanding of the entropy concept if not already familiar with it.
Claude Shannon (https://en.wikipedia.org/wiki/Claude_Shannon) defined the entropy to calculate the minimum encoding size as he was looking for a way to efficiently send messages without losing any information.
As we will see below, there are various ways of expressing the entropy.
The entropy of a probability distribution is as follows:
We assume that we know the probability P for each i. The term i indicates a discrete event that could mean different things depending on the scenario you are dealing.
For continuous variables, it can be written using the integral form:
Here, x is a continuous variable, and P(x) is the probability density function.
In both discrete and continuous variable cases, we are calculating the expectation (average) of the negative log probability which is the theoretical minimum encoding size of the information from the event x.
So, the above formula can be re-written in the expectation form as follows:
x~P means that we calculate the expectation with the probability distribution P.