Information Theory Simplified(?)
by way of explanation

Shannon's Information Theory can be a useful tool for measuring the amount of order in a system. However the name is somewhat misleading... Information in this context has nothing to do with Meaning, and this trips up many new-comers. It has also been suggested (I believe I owe this coinage to Murray Gell-Mann but my belief is not strong enough to generate quotation marks) that it should have been called Confusion Theory, partially because what is being measured is the amount of Randomness in a system. A system that has one-and-only-one state for all time has an Information value of 0, whereas the seemingly random universe has an estimated value of 10^91 bits (http://www.mdpi.com/1099-4300/10/3/150/pdf). Adding to the confusion, the Information Measure is called Entropy -- we have Johnny von Neumann to thank for this, as he was looking over Claude Shannon's shoulder when names were being assigned and noticed that the soon-to-become-familiar equation looked just like the one for Thermodynamic Entropy. This also trips up many new-comers, but in-probable-fact it is an apt analogy (see the Maxwell's Demon discussion on the wiki page: http://en.wikipedia.org/wiki/Entropy_%28information_theory%29). There are also many <IMHO>misguided</IMHO> attempts to use entropy as a measure of complexity which is often an orthogonal quantity to randomness (this Crutchfield,Young 1989 paper chases some of that down if you are really interested).

So what is Shannon Information? From the wiki page:

If p denotes the probability mass function of X then the entropy can explicitly be written as
H(X) = \sum_{i=1}^n {p(x_i)\,I(x_i)} = -\sum_{i=1}^n {p(x_i) \log_b p(x_i)},

where b is the base of the logarithm used. Common values of b are 2, Euler's number e, and 10, and the unit of entropy is bit for b = 2, nat for b = e, and dit (or digit) for b = 10.[3]

The scale of the entropy value is selected by the base of the log function used, and the most common scale is log2 giving a measure in bits. Fortunately it is easy to convert between scales because:

log2 N = log10 N / log10 2

So you can do all your calculations with "regular" log functions and then divide the final result by (log10 2).

What the equation says is: Given a probability distribution X, i.e., a set of probabilities for things that might happen where the full set adds up to 1.0, such as heads or tails in a coin flip (2 equal probabilities of 0.5 each) or the numbers rolled on a six sided die (6 equal values of 0.1666... each), the Shannon Entropy, H(X), is the sum of each value in the set times the logarithm of that value. The amazing thing is that this works just as well for things that are not equally probable. What the entropy measures is how random the distribution is. If the coin or die is weighted, or unfair, the measured entropy will be less than that predicted for the equal distribution. It puts a fundamental top limit on how much Information you can have in particular system.

In many situations it is helpful to think of the Shannon Information Entropy as a measure of how surprised you are to see a new piece of data. For instance our old friend the coin flip: If the coin is fair, every flip gives you some new and unexpected (not predictable) result. You get one more bit of information, and a sequence of 8 coin flips gives you 8 bits. But, if the coin had two heads you would not be surprised to get a head after every flip and you would have 0 bits of new information. Since many interesting things are boolean in nature, bits of Information are quite useful. Getting entropy values in bits also makes it possible to estimate the amount of computer storage one might need, a byte of storage can represent 256 equally likely values and contains 8 bits.


To illustrate I made a little spreadsheet that calculates the entropy in dits and bits for some simple probability distributions. The basic calculation process is:
  1. Get a list of events that can occur, with their associated probabilities.
    For a coin this would be heads or tails, each with a probability of 0.5;
  2. Multiply each probability, P, by its logarithm, you can use any base you like, but for instance:
    N = P * log10 P
    Note that for P=0, log is undefined and it is logically ok to use N = 0. For the coin with P = 0.5, you get: N = -0.150515;
  3. Sum up all the N's that you calculated and negate it. This will give you the entropy value in decimal dits.
    Note that you need to negate it because logarithms of numbers between 0 and 1 are negative, but by convention we want to end up with a positive result. Again for the coin, you have two equal probabilities -- one for heads and one for tails -- so your -Sum = 0.30103;
  4. To get entropy in bits, divide your dits by log10 2.
    This will give you: Bits = 0.30103 / 0.301029996 = 1... Not surprisingly a coin flip gives you one bit of information because the result was either heads or tails!
On the spreadsheet I calculate the Entropy of the following:
  1. A fair coin, as we saw above: H(X) = 1 bit;
  2. An unfair 80%/20% coin: H(X) = .7219 bits;
  3. An 8 bit number, in a fairly complicated way, but strangely enough: H(X) = 8 bits;
  4. A six sided fair die: H(X) = 2.584 bits;
  5. The entropy of several data distributions. Some are made up and some are generated by Excel distribution functions and then histogrammed into 13 bins. They suffer from having only around 100 samples per distribution and the course grained binning, but they serve to illustrate that as things get more evenly distributed -- i.e. more random -- their entropy value goes up:
    1. A single bin with probability = 1.0: H(X) = 0 bits;
    2. A wider step function having 3 bins with P= 0.33 each: H(X) = 1.584 bits;
    3. A "normal" Gaussian distribution with stddev = 1: H(X) = 2.195 bits;
    4. Another "normal" Gaussian distribution with stddev = 3: H(X) = 3.271 bits;
    5. A generated "uniform" distribution, which is not very uniform: H(X) = 3.505 bits;
    6. An (artificially) flat distribution made by hand: H(X) = 3.789 bits;