Saturday, May 18, 2013

Information Theory in Three Minutes

Claude Shannon, the father of information theory, used to play an interesting game at cocktail parties. He'd grab a book, open it to a random page, and cover up all but the first letter on the page, then ask someone to guess the next letter. If the person couldn't guess, he'd uncover the letter, then ask the person to guess the next letter. (Suppose the first two letters are 'th'. A reasonable guess for the next letter might be 'e'.) Shannon would continue in this manner, keeping score, until a good deal of text had been guessed. The further along one goes in this game, the easier it becomes (of course) to guess downstream letters, because the upstream letters provide valuable context.

What Shannon consistently found from experiments of this sort is that well over half of English letters are redundant, because they can be guessed in advance. In fact, Shannon found that when all forms of redundancy are taken into account, English is more than 75% redundant, with the average information content of a letter being approximately one bit per symbol. (Yes, one bit. See Shannon's "Prediction and Entropy of Printed English.")

Claude Shannon
Shannon became intrigued by questions involving the efficiency of information transfer. What is the nature of redundancy in an information stream? Are some encodings more redundant than others? How can you quantify the redundancy? Eventually, Shannon elaborated a mathematical theory around the encoding and decoding of messages. That theory has since become extremely important for understanding questions of encryption, compression, detection of faint signals in the presence of noise, recovery of damaged signals, and so on.

A central concept in Shannon's theory is that of entropy. "Shannon entropy" is very widely misunderstood and/or misinterpreted, so it's important to be clear on what it's not. It's not disorder: Entropy, in information theory, is not the same as entropy in thermodynamics, even though the mathematics are similar. Shannon liked to consider entropy a statistical parameter reflecting the amount of information (or resolved uncertainty) encoded, on average, by a symbol. We think of the English alphabet as having 26 symbols. Since 26 values can be encoded in log2(26) == 4.7 bits, we say that the channel bandwidth for 26-letter English is 4.7 bits per symbol, but this is not the entropy. Shannon found that the entropy (the actual bits used per symbol) was closer to 1.0 than to 4.7. How can this be? The answer has to do with the fact that some symbols are used far more often than others; and also (as noted), some symbols are redundant by virtue of context.

Entropy gets to the actual (rather than ideal) information content of a message by taking into account actual frequencies of usage of symbols. If English text used all letters of the alphabet equally (and unpredictably), then the entropy of text would be exactly 4.7 bits per symbol. Each symbol would contribute 1/26th of -log2(1/26) to the total. But because some letters are used more or less frequently than others, they contribute more or less than 1/26th of log2(1/26), and that total can add up to less than 4.7.

It's easy to visualize this with a simple example involving coin-tossing. Suppose, for sake of example, that a series of coin tosses comprises a message. As a medium of communication, the coin toss is capable of expressing only two states: heads, or tails. This could be represented in binary form as 1 and 0. If half of all tosses are heads and half are tails, then the total entropy in the message is 0.5 * log2(0.5) for heads plus 0.5 * log2(0.5) for tails, or one bit per symbol (Note: If you actually do the math you'll come up with a negative-1. Hence, in entropy calculations, the result is usually multiplied by -1 so it can be expressed as a positive number.)

Consider now the situation of a two-headed coin. In this case, there is no "tails" term and the heads term is 1.0 * log2(1.0), or zero. This means the tossing of a two-headed coin resolves no uncertainty and carries no information.

Continuing the example, consider the case of a weighted penny that falls heads-up two-thirds of the time. Intuitively, we know that this kind of coin toss can't possibly convey as much information as a "fair" coin toss. And indeed, if we calculate 2/3 * log2(2/3) for heads plus 1/3 * log2(1/3) for tails, we get an entropy value of 0.9183 bits per symbol, which means that each toss is (on average) 1.0 - 0.9183 == .0817 or 8.17% redundant. If one were to take a large number of coin tosses involving the weighted penny and convert those tosses into symbols ('h' for heads and 't' for tails, say), the resulting data stream would be compressible to 91.83% of its fully expanded size, and then it wouldn't compress any more beyond that, because that's the entropy limit.

Actually, that last statement needs to be qualified. We're assuming, throughout this example, that the result of any given coin toss does not depend on the outcome of the preceding toss. If that rule is violated, then the true entropy of the "message" could be much lower than 0.9183 bits per symbol. For example, suppose the result of 12 successive coin-tosses was: h-h-t-h-h-t-h-h-t-h-h-t. There's a recurring pattern, and the pattern makes the stream predictable. Predictability reduces entropy; remember Shannon's cocktail-party experiment. (You might ask yourself what a message with all possible redundancy removed would look like, and in what way or ways, if any, it would differ from apparent randomness.)

Technically speaking, when symbols represent independent choices (not depending on what came before), the entropy can be calculated as before, and it's called the order-zero entropy. But if any given symbol depends on the value of the immediately preceding symbol, we have to distinguish between order-zero and order-one entropy. There are also order-two, order-three, and higher-order entropies, representing contexts of contexts.

Suppose now I tell you that an organism's DNA can contain only two types of base-pairs: GC and AT. (You should be thinking "coin toss.") Suppose, further, I tell you that a particular organism's DNA is 70% GC. Disregarding higher-order entropy, does the DNA contain redundancy? If so, how much? Answer: 0.7 * log2(0.7) for GC plus 0.3 * log2(0.3) for AT equals 0.8813, meaning redundancy is about 12%. Could the actual redundancy be higher? Yes. It depends what kinds of recurring patterns exist in the actual sequence of A, G, C, and T values. There might be recurring motifs of many kinds. Each would send entropy lower.

Further Reading
Shannon's best-known paper, "A Mathematical Theory of Communication," Bell Systems Tech. Journal, October 1948
"A Symbolical Analysis of Relay and Switching Circuits," Shannon's unpublished master's thesis
Claude Shannon's contribution to computer chess
Shannon-Fano coding
Nyquist-Shannon Sampling Theorem