Entropy: An Introduction

As the first digital computers were being invented in the 1940s and researchers like Alan Turing were formalizing their capabilities, the scientific and engineering communities needed a way to quantify previous intuitive notions of information and communication. In 1948 Claude Shannon, during a stint at Bell Labs (also the birthplace of the transistor and Unix operating systems, among other major achievements in computer science) provided the first comprehensive framework to do so in his landmark paper A Mathematical Theory of Communication. With the tools provided by this theory (now known as Information Theory), we can answer important questions about the limits of modern computers like: How can we store information (say, a file on our hard drive) in the most efficient way? Or, how do we successfully communicate information (say, a text message) over a channel (e.g., our cell network) that may not be 100% reliable?

We’ve provided a couple examples of “information” in the abstract sense, but how can we define information in a formal, precise way? We’ll start with an example. Let’s say the city of Seattle has decided to introduce a new service, whereby every morning each resident gets a text message with the day’s weather forecast. If it’s wintertime, where the weather is overcast ~100% of the time, sending the text is redundant; a resident will always be correct if they have basic knowledge of the city’s weather patterns and assume that every day will be overcast. In other words, the forecast won’t provide any new information for city residents, since there’s no uncertainty in the outcome.

Now, suppose instead we’re in a different season with multiple possibilities for the day’s weather. In this case our service will actually be useful - residents no longer can just assume that the weather will be overcast every day and be correct 100% of the time. Just how useful is a given forecast? Suppose we’re in the summer, and the weather (optimistically) is sunny 75% of the time and overcast 25% of the time. In this case, a resident would be correct on most days if they naively assumed the day’s weather would be sunny, and a forecast of a sunny day thus doesn’t provide much new information as compared to the naive baseline. Correspondingly, due to their relative rarity, a forecast indicating a cloudy day would be much more informative, as the naive prediction would be incorrect in this case.

From the previous example we see that the information transmitted when communicating the outcome of an event is intimately related to frequency of that outcome. With this idea in mind, we introduce our first concept in information theory, self-information.

Let’s say we have some discrete random variable Xp(x)X \sim p(x) taking on a set of potential values xXx \in \mathcal{X} with a corresponding probability mass function p(x)p(x). We then define

The self-information (sometimes known as "surprisal") of an outcome xXx \in \mathcal{X} is defined as I(x)=log(1p(x))=logp(x) I(x) = \log\bigg(\frac{1}{p(x)}\bigg) = -\log p(x)

We usually take the log\log in the above definition to be base 2, so self-information is measured in bits. At a first glance our definition of I(x)I(x) may seem arbitrary, but we can see that it captures the ideas we had in our thought experiment earlier, along with some other nice properties. As p(x)1p(x) \to 1, we have that I(x)0I(x) \to 0. This aligns with our intuition that information about an event is only transferred from a source to a recipient if the recipient was not already aware of that event’s outcome. Similarly, as p(x)0p(x) \to 0, we have that I(x)I(x) \to \infty, reflecting that the communication of a rare outcome is more informative than a common one.

Finally, suppose we wish to communicate the outcomes of two independent events X1,X2i.i.d.p(x)X_1, X_2 \stackrel{\text{i.i.d.}}{\sim} p(x). Since the outcomes of X1X_1 and X2X_2 are independent of each other, we’d expect that a compound message containing both outcomes would convey an amount of information equal to the sum of sending two messages containing each outcome individually. The self-information satisfies this property too, since for some joint outcome (x1,x2)X×X(x_1, x_2) \in \mathcal{X} \times \mathcal{X} we have

I(x1,x2)=logp(x1,x2)=log(p(x1)p(x2))=[logp(x1)+logp(x2)]=I(x1)+I(x2) \begin{aligned} I(x_1, x_2) &=-\log p(x_1, x_2)\\ &= -\log\big(p(x_1)p(x_2)\big)\\ &= -[\log p(x_1) + \log p(x_2)]\\ &= I(x_1) + I(x_2) \end{aligned}

which matches from what we’d expect of a measure of information.

Now, suppose the outcome isn’t known ahead of time, but we still want a sense of how much information we’ll be transmitting. One simple way to do this is to weight the self-information I(x)I(x) of each outcome by the probability of that outcome p(x)p(x). This gives us our second major concept in information theory, entropy

The entropy H(X)H(X) of a random variable XX is defined as H(X)=xXp(x)I(x)=xXp(x)logp(x) H(X) = \sum_{x \in \mathcal{X}}p(x)I(x) = -\sum_{x \in \mathcal{X}}p(x)\log p(x)

By convention we assign a value of 00 to p(x)logp(x)p(x)\log p(x) when x=0x = 0, even though technically 0log00\log 0 is undefined. We can justify this by continuity, as it can be shown p(x)logp(x)0p(x) \log p(x) \to 0 as p(x)0+p(x) \to 0^{+}.1 This too aligns with our intuition - an outcome with no chance of occuring doesn’t have any effect on the variability of our event XX.

Entropy is quite possibly the “fundamental unit” of information theory, and it’ll continue coming up in all kinds of interesting ways. We can think of H(X)H(X) as quantifying how much we’ll be “surprised” by the outcome of XX on average. An example will help illustrate this idea

Let XBern(p)X \sim Bern(p). In other words, we have X={1with probability p0with probability 1p X = \begin{cases} 1 & \text{with probability $p$} \\ 0 & \text{with probability $1-p$} \end{cases}

Plugging in to our formula for entropy, we have

H(X)=plogp(1p)log(1p) H(X) = -p\log p - (1 - p)\log(1-p)

To build up our intuition we can visualize our entropy for different values of pp in the following graph (where we let H(p)H(p) denote H(X)H(X) as a function of pp)

Figure 1. Entropy of XBern(p)X \sim Bern(p) as a function of pp (mouse over to see precise values). Note that H(p)H(p) is maximized when p=1/2p = 1/2 (maximum variability in the outcome) and minimized when p=0p = 0 or p=1p = 1 (no variability in the outcome).

As we would expect, H(X)H(X) is 0 when XX is deterministic (p=0p = 0 or p=1p = 1) as there’s no potential for surprise in such cases. On the other hand, it’s maximized when p=1/2p = 1/2 when the variability in outcomes is greatest. We’ll work through another more complicated example before moving on.

Let XGeom(p)X \sim Geom(p). In other words, the probability that we need kk Bernoulli trials before reaching our first 1 is (letting q=1pq = 1 - p) P(X=k)=qk1p P(X = k) = q^{k-1}p

Plugging into our formula for entropy we have

H(X)=k=1qk1plog(qk1p)=[k=1pqk1logp+k=1(k1)pqk1logq]=[k=0pqklogp+k=0kpqklogq]=[plogpk=0qk+plogqk=0kqk] \begin{aligned} H(X) &= -\sum_{k=1}^{\infty}q^{k-1}p\cdot \log\big(q^{k-1}p\big) \\ &= -\bigg[\sum_{k=1}^{\infty}pq^{k-1}\log p + \sum_{k=1}^{\infty}(k-1)pq^{k-1}\log q\bigg]\\ &= -\bigg[\sum_{k=0}^{\infty}pq^{k}\log p + \sum_{k=0}^{\infty}kpq^{k}\log q\bigg] \\ &= -\bigg[p\log p\sum_{k=0}^{\infty}q^{k} + p\log q\sum_{k=0}^{\infty}kq^{k}\bigg] \end{aligned}

From calculus, we know that n=0rn=11r\sum_{n=0}^{\infty}r^{n} = \frac{1}{1-r} and n=0nrn=r(1r)2\sum_{n=0}^{\infty}nr^{n} = \frac{r}{(1-r)^2} for any real number rr with r<1|r| < 1. This gives us

H(X)=[plogpp+pqlogqp2]=[plogp+qlogqp]=[plogp+(1p)log(1p)p] \begin{aligned} H(X) &= -\bigg[\frac{p\log p}{p} + \frac{pq\log q}{p^2}\bigg] \\ &= -\bigg[\frac{p\log p + q\log q}{p}\bigg]\\ &= -\bigg[\frac{p\log p + (1-p)\log (1-p)}{p}\bigg]\\ \end{aligned}

If we let YBern(p)Y \sim Bern(p), this further simplifies to

H(X)=H(Y)p \begin{aligned} H(X) &= \frac{H(Y)}{p} \end{aligned}

and we visualize our entropy for for different values of pp below:

Figure 2. Entropy of XGeom(p)X \sim Geom(p) as a function of pp. Note that H(p)0H(p) \to 0 as p1p \to 1 and H(p)H(p) \to \infty as p0p \to 0. This should agree with our intuition; with a high chance of success we'd expect our process to end quickly. On the other hand, with a lower chance of success we have a much fatter tail with a higher variability of outcomes.

As we might expect, H(p)0H(p) \to 0 as p1p \to 1 since there’s less variability in the number of trials we need until our first success. On the other hand H(p)H(p) \to \infty as p0p \to 0, which reflects the fact that it’s much harder to predict the number of trials required to reach a success as the probability of success is lowered.

One immediate consequence of the definition of entropy is the following:

H(X)0 H(X) \geq 0


Going back to our definition of entropy:

H(X)=xXp(x)logp(x) H(X) = -\sum_{x \in \mathcal{X}}p(x)\log p(x)

Since our p(x)p(x) terms are probabilities, we must have 0p(x)10 \leq p(x) \leq 1. Since logx0\log x \leq 0 for 0<x10 < x \leq 1, so we must have xXp(x)logp(x)0\sum_{x \in \mathcal{X}}p(x)\log p(x) \leq 0. Multiplying by the negative sign in the definition of H(X)H(X) then completes the proof. \square

This fact will prove very useful as we proceed in the course, so it’s good to commit to memory now.

The reader may have noticed that our definition of entropy is nearly identical to that of the expected value of a random variable, differing only in the presence of the log\log function. This begs the question - can we reconcile these two concepts? First, as a reminder we define expected value as

The expected value E[X]E[X] of a discrete random variable Xp(x)X \sim p(x) is E[X]=xXp(x)x E[X] = \sum_{x \in \mathcal{X}} p(x)\cdot x

Now, suppose we apply some deterministic function gg to our random variable XX - this gives us a new random variable g(X)g(X). Since the randomness in g(X)g(X) is entirely due to the randomness in XX, intuitively we should be able to compute E[g(X)]E[g(X)] using XX’s distribution p(x)p(x) without needing to compute g(X)g(X)’s distribution. It turns out that our intuition here is correct; this is a well-known result in probability theory that is used enough to have its own name

For a discrete random varible Xp(x)X \sim p(x) and function g ⁣:RRg\colon \mathbb{R} \to \mathbb{R} E[g(X)]=xXp(x)g(x) E[g(X)] = \sum_{x \in \mathcal{X}} p(x)\cdot g(x)

Despite the intuitive nature of this theorem, proving it requires some work which we omit here and leave as an exercise for the reader.2 Setting g(X)=logp(X)g(X) = -\log p(X) and using LOTUS we then have

H(X)=E[logp(X)]=E[I(X)] H(X) = E[-\log p(X)] = E[I(X)]

which allows us to formally justify our notion of entropy as the average amount of information contained in the outcome of a random varible.

We’ll go over one final example before concluding these notes. Suppose we want to communicate the outcome of an event represented by a random variable XX. Ideally (to save money, power, etc.) we’d like to do so with as short a message as possible on average. Naturallly, our encoding scheme should use fewer bits to represent high-probability events and more bits to encode low probability ones. It turns out (we’ll prove this later), any coding scheme for this problem requires at least I(X)I(X) bits on average to send our message without corrupting its contents. In other words, H(X)=E[I(X)]H(X) = E[I(X)] provides a lower bound on the average number of bits required to guarantee that our message will be transmitted successfully. While we won’t have the tools to prove this until later, a simple example will help illustrate this idea

Let X={awith probability 1/2bwith probability 1/4cwith probability 1/8dwith probability 1/8X = \begin{cases} a & \text{with probability 1/2} \\ b & \text{with probability 1/4} \\ c & \text{with probability 1/8} \\ d & \text{with probability 1/8} \end{cases}

How could we devise a code to transmit the outcome of XX in this case? One possible scheme is as follows: If X=aX = a, we send a 1 and end transmission, otherwise we send a 0 and continue. Similarly, if X=bX = b, we then send a 1 and stop, otherwise we send another 0. Finally, if X=cX = c we send a 1 and stop, otherwise we send a 0 and stop. Under this scheme, we send a total of 1 bit half the time, 2 bits 1/4 of the time, and 3 bits 1/4 of the time, giving us a total of 1.75 bits on average.

Figure 3. Visual representation of our probability space. At each step of our encoding algorithm, we eliminate half of the space that still remains.

Each time we ask our “questions” in this scheme, we’re eliminating half of the remaining probability space. This is very similar to the way that binary search eliminates half the possible solution space at each step of the algorithm. Moreover, just as binary search is an optimal algorithm (i.e., takes the fewest number of steps possible to still guarantee a correct answer), we would expect our encoding scheme to be optimal for XX.

We can verify this intuition by plugging into our formula for entropy

H(X)=12log1214log1418log1818log18=1.75 bitsH(X) = -\frac{1}{2}\log\frac{1}{2}-\frac{1}{4}\log\frac{1}{4}-\frac{1}{8}\log\frac{1}{8}-\frac{1}{8}\log\frac{1}{8} = 1.75 \text{ bits}

Since our average number of bits needed to encode the outcome of XX can’t be less than H(X)H(X), our proposed scheme is indeed optimal.


  1. One potential proof is to apply L’Hôpital’s rule rule from calculus 

  2. See here for additional details and an example proof