Understanding Entropy and Cross-entropy

Information theory

3 min readApr 20, 2019

Entropy

(Physics): The measure of chaos or disorder in a system.
The lower the order, the higher the entropy.
(Information theory): Measure of information in terms of uncertainty.
The higher the uncertainty, the higher the entropy. The higher the entropy, the more amount of information is contained in the system.

Measure of information

To understand what is information and entropy, let’s start with an example : we flipped a coin and we want to know which side it landed on. What is the amount of information ? Or, how many questions do we have to ask before we know the state of the system ? The more questions we need to ask, the more states it can possibly be in so the more uncertainty we have. If the coin has two heads, we know it lands on head. We don’t need to ask any question and therefore we have zero uncertainty about the outcome. So this system has 0 entropy and contains 0 information.
If the coin is fair, it has two different sides. We need to ask at least one question to know the state of the system. “Is it tail?” Or “is it head?”. So the entropy of this system is 1.

Let’s define an event X which has different outcomes M (M = 2 in the case of a fair coin flip). The amount of information received from X is I(X).

The base of the logarithm can vary, for this example we will use the binary log, base 2, and pretend we measure the information in bits. In machine learning we will use more frequently natural logs with base e.

In the case where the coin always lands on head, M = 1 :

In the case of a die roll, M = 6 :

When possible outcomes A = {a1, a2,…,am} with probability P = {p1, p2,…,pm}

For example if we have A set of possible outcomes: A = {a1, a2} and a set of probabilities P = {0.75 , 0.25}, I(a1) = 0.415 and I(a2) = 2.

Measure of entropy

Overall uncertainty for an information source when the outcomes are unequal in their probability to happen = Entropy.

Measure of cross-entropy

When we have a true probability distribution p and a predicted distribution q, we measure the cross-entropy :

KL Divergence

If q = p, the cross entropy is equal to the entropy. But if q != p, the cross-entropy will be greater than the entropy and the amount between them is called the relative entropy, or KL divergence.

In machine learning, we can use the cross-entropy between the two distribution q and p as cost function when evaluating or training a classifier. This is the cross-entropy loss (or log loss). This is where we mostly use the natural log.