Arunn's Notebook

Information Entropy

Arunn Narasimhan

1. Information and Uncertainty
Consider the case of one of the two boxes shown in Fig. 1 below, containing a ball in it. You are asked to find in which box the ball is by asking me questions. I will answer only with a ‘yes’ or a ‘no’.

Figure 1

Now, before you ask any question to me, you don't know which box contains the ball. In other words, you are uncertain and have no information about the system to decide on the correct answer of which box contains the ball. Once you ask questions and receive answers – in this case, one is enough – you gain information and are no more uncertain in making your decision. This implies that the decision that has to be made by you between the two boxes will depend on some logic and is not random guess work.

We could conclude that in view of such a decision making process, Uncertainty decrease leads to Information increase. We shall see in subsequent sections that Information and Uncertainty can be treated as quantities with numerical values and used to describe the efficiency of decision making processes, whether such processes are similar to the ball in the box example in Fig. 1 or picking out a ‘set of symbols’ from a relatively large sequence of symbols, as done in communication engineering.

2. The Real Bit
To quantify and measure Uncertainty we need to introduce the concept of the ‘real’ bit. A bit is defined as the amount of Information necessary to decide between two equal choices. As we will see later, unlike in computer parlance, the concept of a bit here can be associated with real numbers.
In the Ball and Box example in Fig. 1 with two boxes, you have to ask at least one question to decide on the correct choice of box. In other words, you require one bit of information to decide between two choices. Similarly, in the example of four boxes and a ball shown in Figure 2 below, for the presence of the ball in any of the boxes being equally likely, you require at least two bits of information. That is, you can be sure of the correct answer of which box the ball is only after at least two questions.


Figure 2

In another way, we can say that the Uncertainty of which box contains the ball is two bits per box before any question is asked by you and reduces to zero once the answer is known. By taking into account all possible eventualities, the number of bits provides the optimal number of questions that leads to the answer.

Extending the example, for 8 equally likely events of boxes containing the ball, we should require at least 3 bits of Information to make a choice (decision). Correspondingly, the Uncertainty of the system is 3 bits per box before any question is asked and reduces to zero, once the answer is known. This process of reduction in Uncertainty is accompanied by a gain in Information, that is, from zero knowledge of which box has the ball to the case of complete knowledge of which box has the ball.

Putting this into a mathematical framework, we can write

 2^b = M \cdots (1)

or upon rearrangement

 b = log_2(M) \cdots (2)

where, b is the number of bits required to determine which of the M different boxes has the ball in it. Obviously number 2 arises because of the two equally likely choices we have for that example (ball in or not in a box). If, for instance, the ball is slightly magnetized and one of the boxes is made of iron, then the outcome will be different and cannot be quantified using Eq. (2).

We shall proceed now to generalize the concepts to include such situations. We pick a new example now which is useful in communication engineering and coding theory.

Consider a sequence comprising of M different symbols and N total number of symbols. These symbols can be alphabets of a particular language (say, English) such that the sequence of letters could form a meaningful word or not. These M different symbols are attributed with different probabilities of occurrence Pi where i denotes which symbol is being referred to.

For example, the sequence baabbcac has N = 7 total number of symbols (counting each of the letters) and M = 3 total number of different symbols (a, b and c). Unlike the earlier boxes and a ball example, here the probability of occurrence for each symbol (a or b or c) in the sequence can be different.

Rearranging Eq. (2) as

 b = -log_2 \frac{1}{M} \cdots (3)

we can see that for the particular case of M equally probable symbols (or boxes in the earlier example) 1/M is the probability of occurrence of each symbol (or the ball to reside in any one particular box). Substituting 1/M as P in Eq. (3), this implies

 b = -log_2(P) \cdots (4)

For instance, in the case of Figure 2, there are 4 equally likely boxes in which the ball could have resided and so M = 4 and so P = 1/4. This leads to b = 2 from Eq. (4), which we concluded earlier. Also, it is clear from the nature of Eq. (4) that the bit b could be a real number.

3. Surprisal and Uncertainty

Whether it is the occurrence of a sequence in a particular way or the residence of a ball in a set of boxes, we know that the sum of all the probabilities is always one. This means, we cannot cheat. In the sequence given in the example in section 2, symbols that are not part of the original set of considered symbols will not appear. Analogously, in Figure 2, the ball will not reside in a fifth box that is not in our list of (four) boxes.

This seemingly obvious observation has significance in that it is used to define a concept called the surprisal of any event or symbol. It is defined as

 u_i = -log_2(P_i) \cdots (5)

Note also that Eq. (5) is similar in form to Eq. (4). If the probability of occurrence P of an event or symbol in a sequence is low, its surprisal u will be high. The occurrence of that low probable event or the appearance of that symbol in that particular location of the sequence will ‘surprise’ us. If the probability of occurrence is zero, we are totally surprised by the occurrence of that symbol, which is not in the original set of considered symbols.

Let us tie this concept of surprisal with Uncertainty (H, measured as b, bits per symbol), from our earlier discussion in section 1. We observe that Uncertainty before we receive a symbol (or before we see the box with the ball) is the same, irrespective of what symbol we receive. This is because the symbol that we don't have cannot change our Uncertainty. It cannot increase our surprisal either. Therefore, Uncertainty H can be defined as the average of surprisal for all the N existing symbols

 H = \frac{u_1 + u_2 + \cdots + u_n}{N} \cdots (6)

Using the summation notation, and accounting for the multiple appearances of any symbol, we can rewrite Eq. (6) as

 H = \Sigma _{i=1}^M \frac{N_i}{N}u_i \cdots (7)

where N_i is the number of appearances of the i th symbol.

Since F_i , the frequency of the i th symbol occurring, is found by dividing N_i , the number of times the particular symbol occurs by N, the number of symbols, we can further rewrite Eq. (7) as

 H = \Sigma _{i=1}^M  F_i u_i \cdots 8

Also, we can see that the probability of occurrence of a symbol in a sequence Pi used in Eq. (5) can also be written as

 P_i = \lim_{N \rightarrow \infty} F_i \cdots (9)

This suggests that for N -> 8, substituting Eq. (9) in Eq. 8 leads to,

 H = \Sigma_{i=1}^M  P_iu_i \cdots (10)

Substituting for ui from Eq. (5) in Eq. (10), we arrive at

 H = -\Sigma_{i=1}^M  P_ilog_2P_i \cdots (11)

Equation 11 is Shannon’s formula for Uncertainty (H), measured in bits per symbol.

Observe that the form of Eq. (11) for Uncertainty has a striking similarity to S = k log W the concept of Thermodynamic Entropy introduced by Ludwig Boltzmann. The story goes (explained in reference [1]) that Shannon, after having described his concept in mathematical form asked his friend von Neuman what to call it and von Neuman asked him to call it entropy as nobody is sure of what it is and he (Shannon) would always have the advantage in any debate over his theory. So Shannon named it Information Entropy although he also used the term Uncertainty to denote the same idea in his paper (see reference [2] for a related discussion).

Calling this Uncertainty H as Information Entropy (measured in bits per symbol) has also led to confusions because of wrong comparisons with and interpretations as the Thermodynamic Entropy (measured in Joules per Kelvin).

The nature of this H function is shown in the figure below. Notice that the curve is symmetrical, and rises to a maximum when the two symbols are equally likely (probability = 0.5). It falls towards zero whenever one of the symbols becomes dominant at the expense of the other symbol.


Figure 3

4. From Uncertainty to Information

We saw in section 1 that a decrease in Uncertainty should increase Information. Let us put this also in a mathematical framework. Let us consider the example of the sequence baabbcac

According to the explanations in section 2, in the above sequence we have N = 8; M = 3; N(a)=3; N(b)=3 and N(c)=2

This implies the frequency of occurrence of 'a' is F(a) = 0.375. Likewise, F(b) = 0.375 and F(c) = 0.25. Using Eq. (5) we can deduce the surprisal of occurrence for symbol 'a' as u(a) = 1.415. Likewise, u(b) = 1.415 and u(c) = 2.000. Substituting in Eq. (10) we arrive at the uncertainty H(after) = 1.5613 after the above sequence has happened.

On the other hand, before the sequence baabbcac is “known” or being received in a receiver as a code sequence, the Uncertainty about that sequence is, H(before) = 1.585, as calculated from Eq. (3). We use Eq. (3) as before the sequence appears, the occurrence of all of the M = 3 symbols (a, b and c) are equally likely.

Using the above calculated Uncertainty values, the Information (R) gained in the process of receiving this particular sequence baabbcac is

 \text{R = H(before) - H(after) = 1.585 - 1.5613  = 0.2 bits/symbol} \cdots (12)

The above equation is Shannon’s formula for calculating Information.

The calculation for H(before) using Eq. (3), as shown above, assumes the elements of the sequence to comprise only of a, b and c (and M = 3). If we allow the possibility of any of the entire English alphabet to appear as elements in the “to be received” sequence, then M = 26. Then H(after) = 1.5613 remains the same for the same sequence **baabbcac**, but using Eq. (3) we can deduce H(before) = 4.7.

This means Information

 \text{R = H(before) - H(after) = 3.1387 bits/symbol} \cdots (13)

Out of the large uncertainty (because of more possibilities of alphabets) of what could have been the sequence, before it arrived, once the sequence arrived, the information gained became more. In the former case of the alphabet source pool being restricted to a b and c (only 3), the sequence that could be received can be “guessed” more correctly (only 3 possibilities are there, per each element). In other words, the information gained in the reception of the sequence is less.

To revisit the boxes and a ball case, for eight boxes and a ball, before the ball is located, there is an Uncertainty of 3 bits. After the ball has been located, it becomes 0 bit. Since Uncertainty has gone from 3 to 0, we gained 3 bits of Information.

5. Conclusions

Information Entropy or Shannon Entropy is a concept that is used primarily in communication engineering and coding theory. In the last decade it has been used in bio-informatics and DNS sequencing, where protein binding is explained using information theory and sequence logos.


[1] See Myron Tribus and M. C. Mc Irvine, Energy and Information, Sci. Amer, vol. 225, 3, pp. 179-188, Sep. 1971

[2] For more clarification, see the excellent post by Dr. Thomas Schneider titled Information is not Entropy and Information is not Uncertainty

[3] Cosma Shalizi has an introduction on Information Theory where he collects and recommends many useful literature and reliable web resources on this topic

[4] some more from Cosma on Ergodic Theory and Estimating Entropies and Information

[5] Information theory: a short introduction: notes on Information Entropy (and how to use it in anthropology) from John Hawks

© Arunn Narasimhan | Original version written ~ July 19, 2009 | Last revision on Apr 01, 2012