Claude Shannon, SM ‘40, PhD ‘40, threw down an irresistible challenge to those who would be pioneers of information theory. A young grad student soon met the challenge, but his solution languished in obscurity for decades.
In 1948, the world was still an analog place. Candid Camera and Ed Sullivan were just beginning their long runs on TV; Jack Benny’s radio show had tens of millions of listeners. But bad reception was a fact of life. Electromagnetic interference, physical obstacles between a transmission tower and a receiver, and other sources of what engineers call “noise” routinely disrupted Benny’s monologues or the performances of Sullivan’s guests. In most areas, for at least some stations, people resigned themselves to snowy images or static-plagued audio.
That same year, however, Claude Shannon, SM ‘40, PhD ‘40, published a landmark paper in which he mathematically proved that even in the presence of a lot of noise, it was possible to transmit information with virtually no errors. It was an analog world, but Shannon’s stunning conclusion was the result of his ability to think digitally. Information in any medium, Shannon argued, could be represented using binary digits, or “bits”–a word that his paper introduced to the world. While noise in a communication channel can corrupt the bits, he explained, adding extra bits that are related to the original bits by some known algorithm–an error-correcting code–will make it possible to deduce the original sequence.
The noisier the channel, the more extra information must be added to make error correction possible. And the more extra information is included, the slower the transmission will be. Shannon showed how to calculate the smallest number of extra bits that could guarantee minimal error–and, thus, the highest rate at which error-free data transmission is possible. But he couldn’t say what a practical coding scheme might look like.
Researchers spent 45 years searching for one. Finally, in 1993, a pair of French engineers announced a set of codes–“turbo codes”–that achieved data rates close to Shannon’s theoretical limit. The initial reaction was incredulity, but subsequent investigation validated the researchers’ claims. It also turned up an even more startling fact: codes every bit as good as turbo codes, which even relied on the same type of mathematical trick, had been introduced more than 30 years earlier, in the MIT doctoral dissertation of Robert Gallager, SM ‘57, ScD ‘60. After decades of neglect, Gallager’s codes have finally found practical application. They are used in the transmission of satellite TV and wireless data, and chips dedicated to decoding them can be found in commercial cell phones.
The Birth of Information Theory
Gallager came to MIT in 1956–the same year Shannon himself returned as a professor, after 15 years at Bell Labs. But it wasn’t the prospect of working with Shannon that led him to choose MIT over Yale, where he had also applied to graduate school. “I was in the army–on a meaningless assignment–and I really hated what I was doing,” says Gallager, who taught at MIT for more than 40 years after earning his doctorate and still advises graduate students as a professor emeritus in the Research Lab of Electronics. “MIT started one week earlier than Yale did. And I was so anxious to get out of the army that that was really my only reason for coming to MIT.”
Gallager wasn’t even sure that he wanted to study information theory, the burgeoning new discipline born of Shannon’s 1948 paper. But before joining the Army Signal Corps, Gallager, too, had worked for several years at Bell Labs, where he spent three days a week in a classroom learning about the latest advances in electrical engineering. Though he’d never met Shannon, that experience helped him recognize the scope of his accomplishment. “I just viewed him as sort of a god,” Gallager says.
Indeed, by the time Shannon joined the MIT faculty, he was a minor celebrity. As early as 1953, an article on information theory in Fortune magazine had declaimed, “It may be no exaggeration to say that man’s progress in peace, and security in war, depend more on fruitful applications of information theory than on physical demonstrations, either in bombs or in power plants, that Einstein’s famous equation works.”
What captured the public imagination was the idea that information in all its diversity–text, audio, video–could be boiled down to mere sequences of 1s and 0s. Commercial digital devices did not yet exist, so it blew people’s minds that 001001010101000101011101 could represent part of a symphony, or part of a movie, or a color, or a line from a book. But as Shannon pointed out in his paper, his Bell Labs colleague Ralph Hartley had made a similar suggestion 20 years earlier. The aspect of the paper that captivated–and continues to captivate–Shannon’s fellow engineers was the ingenious way he proved that there must be some code capable of producing error-free data transmissions right up to a channel’s capacity.
To understand how an error-correcting code works, consider someone who’s trying to send a four-bit message across a noisy channel. If the noise causes one of the bits to flip to its opposite, the receiver has no way of knowing that an error has occurred. Simply repeating the message, so that 0011 becomes 00110011, solves that problem: now, if one bit flips to its opposite, the receiver knows there’s an error, because the two versions of the message don’t match. But it’s impossible to tell which one is correct. A better way to encode the message might use the four extra bits to represent information about the message bits: the fifth bit, for instance, could tell you whether the first two bits of the message have the same or different values; the sixth bit could do the same for bits three and four, the seventh for bits one and three, and the eighth for bits two and four. If one of the first four bits gets flipped, the last four can identify it; if one of the last four bits gets flipped, the other three may convey enough information to make up for it.
Shannon’s paper, however, eschews any such ruminations about how to actually construct codes. Instead, it approaches the concept of error correction by statistically analyzing the general properties of codes selected entirely at random. To get a sense of his approach, it might help to see how it could be applied to our hypothetical eight-bit sequences that encode four-bit messages.
There are 16 possible four-bit messages, and Shannon’s method would assign each of them its own randomly selected eight-bit serial number–its “code word.” The receiver, like the sender, would have a codebook correlating the 16 possible four-bit messages with the 16 random eight-bit code words. Since there are 256 possible sequences of eight bits, there are 240 that don’t appear in the codebook. Someone who receives one of those 240 sequences will know that an error has crept into the data. But as long as the 16 permitted code words are different enough from each other, there’s likely to be only one that comes closest to the corrupted sequence. For instance, if 00000001 and 11111110 are both valid code words but 00000011 is not, then someone who receives the sequence 00000011 can conclude that the intended code word was much more likely to be 00000001 than 11111110.
In real life, of course, no one is worried about transmitting messages of only four bits. But by using statistical analysis, Shannon was able to draw conclusions about encoded messages of any length, sent over channels with any amount of noise. In particular, he was able to rigorously quantify both the degree of difference between randomly selected code words and the likelihood that a corrupted sequence would resemble only one of them. While the probability that two eight-bit sequences will be similar is relatively high, Shannon showed that as code words get longer, the chances of similarity decrease exponentially. In fact, one of his most startling results was that for long messages, most randomly assigned code words will be almost as different from one another as it’s possible for them to be. That means that almost any coding scheme–any way of generating those words–would allow error-free transmission across a noisy channel at near the maximum rate.
“It took a lot of intuition to think that a perfectly random code might be a pretty good code on average,” says David Forney, SM ‘63, ScD ‘65, a former vice president of the Codex Corporation and Motorola who returned to MIT in 1996 as an adjunct professor. “It turns out that that drastically simplifies the analysis, because now you can do an average-case analysis.” Forney pauses for a moment, then adds, “Not to say that it was totally simple: he had to invent a few theorems at least, if not branches of mathematics.” But Gallager agrees. Of Shannon’s 1948 paper, he says, “After you study it for two years, it seems very simple. So many people will tell you, ‘It’s really very simple.’ And after you understand it, it is.”
An Irresistible Challenge
Shannon’s mathematical description of information had many ramifications. His 1948 paper also introduced the idea of data compression, or representing the same information with fewer bits; compression is what lets programs like WinZip or StuffIt shrink files down so that they don’t overwhelm e-mail servers, and it’s used to save space in disk drives. Information theory also put the study of cryptography on a more secure mathematical footing; indeed, Gallager believes that it was Shannon’s wartime cryptographic work at Bell Labs that led him to his novel reconception of communication.
By the time Shannon returned to MIT, however, he had begun to feel that the enthusiasm surrounding his theory exceeded even its considerable merits. In a 1956 article called “The Bandwagon,” he cited attempts to apply information theory to fields such as “biology, psychology, linguistics, fundamental physics, economics, the theory of organization, and many others” and undertook to “inject a note of moderation in this situation.”
Shannon’s distaste for the limelight bordered on reclusiveness. According to Joel West ‘79, a professor at San José State University’s College of Business who’s writing a book about the development of information theory, Shannon advised only seven graduate students during his 22 years at MIT. “He was quite shy and retiring, so if you wanted to get him as a supervisor, you really had to be quite aggressive about it,” says Gallager. “I was shy and retiring also, and didn’t have enough self-confidence to even go in and talk to the guy.”
As a teacher, Shannon had little patience for the tedium of the familiar. “He was much more interested in the new than in the old,” says Elwyn Berlekamp ‘62, SM ‘62, PhD ‘64, a professor emeritus of mathematics at the University of California, Berkeley, who (with Gallager) was a coauthor on Shannon’s final published paper.
“He did not teach a whole lot,” says Gallager. “But when he taught, it was like giving research talks. I remember once he gave a course, which was about 25 lectures during the term, and every lecture was a new research result. He would do them one after another and never failed to come up with something interesting. It was a really fantastic period.”
“Shannon was, in my opinion, a little bit out of place in academia,” says James L. Massey, SM ‘60, PhD ‘62, an information theorist and professor emeritus at ETH Zurich. “His real genre was to be an independent researcher and do things in his own highly individualistic way.”
It may be, too, that Shannon was simply uncomfortable with adulation. Berlekamp recalls when the IEEE Information Theory Society invited Shannon to deliver a lecture and receive its inaugural Shannon Award in Israel in 1973. “I’ve never seen anybody with more butterflies than him,” he says. “Five minutes before the talk’s to start, he’s at the bar, and he’s pretty depressed. He’s really afraid of going on stage and disappointing everybody. Because of course they expect God, which is true, and he knows he can’t perform like God.”
But if Shannon was rarely a direct mentor to young students of information theory, he had set them an irresistible challenge. Random coding would never work in practice: the size of Shannon’s hypothetical codebook doubled with each additional bit in the message. The codebook for a single 1,000-bit packet of data traveling over the Internet would require more entries than there are atoms in the universe. But any more practical coding mechanism–like repeating the original message, or adding extra bits that described message bits–was the equivalent of some random coding scheme, in that it would generate the same code words. And by demonstrating that the vast majority of random coding schemes were capacity-approaching, Shannon offered hope that one of the practical ones was as well.
Instead of using a codebook to match code words and messages, a practical coding scheme would provide a way to extract the message from the code words computationally. A series of mathematical operations could, with a high probability of accuracy, identify and correct errors in a possibly corrupted bit sequence received over a noisy channel.
It’s one of the peculiarities of error-correcting codes that a good encoding algorithm doesn’t necessarily imply a good decoding algorithm. Using statistical analyses similar to Shannon’s, coding theorists were able to show that a given code was capacity-approaching–that it would maximize the difference between code words. But that didn’t mean they had an efficient way to decode it.
Between the publication of Shannon’s paper and the early 1990s, researchers proposed better and better codes and also better and better decoding algorithms. But a practical capacity-approaching code remained elusive. “There used to be a saying among coding theorists,” says Forney, “that almost any code is a good one–except for all the ones we can think of.”
The codes that Gallager presented in his 1960 doctoral thesis were an attempt to preserve some of the randomness of Shannon’s hypothetical system without sacrificing decoding efficiency. Like many earlier codes, Gallager’s used so-called parity bits, which indicate whether some other group of bits have even or odd sums. But earlier codes generated the parity bits in a systematic fashion: the first parity bit might indicate whether the sum of message bits one through three was even; the next parity bit might do the same for message bits two through four, the third for bits three through five, and so on. In Gallager’s codes, by contrast, the correlation between parity bits and message bits was random: the first parity bit might describe, say, the sum of message bits 4, 27, and 83; the next might do the same for message bits 19, 42, and 65.
Gallager was able to demonstrate mathematically that for long messages, his “pseudo-random” codes were capacity-approaching. “Except that we knew other things that were capacity-approaching also,” he says. “It was never a question of which codes were good. It was always a question of what kinds of decoding algorithms you could devise.”
That was where Gallager made his breakthrough. His codes used iterative decoding, meaning that the decoder would pass through the data several times, making increasingly refined guesses about the identity of each bit. If, for example, the parity bits described triplets of bits, then reliable information about any two bits might convey information about a third. Gallager’s iterative-decoding algorithm is the one most commonly used today, not only to decode his own codes but, frequently, to decode turbo codes as well. It has also found application in the type of statistical reasoning used in many artificial-intelligence systems.
“Iterative techniques involve making a first guess of what a received bit might be and giving it a weight according to how reliable it is,” says Forney. “Then maybe you get more information about it because it’s involved in parity checks with other bits, and so that gives you an improved estimate of its reliability.” Ultimately, Forney says, the guesses should converge toward a consistent interpretation of all the bits in the message.
Although Gallager hadn’t been able to muster the courage to ask Shannon to be his advisor, he says that he did talk to Shannon “three or four times” while writing his thesis. “Except that talking to Claude three or four times was like talking to most people 50 times,” he says. “He was somebody who really caught on to the ideas very, very fast. He was not great at all the technical details. But to see the structure of something, to see why it ought to work, and to see what might make it better–well, he was certainly the smartest person I’ve ever met.”
Still, Shannon did not foresee the success of Gallager’s codes. “My recollection is he thought they were interesting, but I didn’t have the sense that he was excited by them,” Gallager says. He understands why. Gallager’s codes approached channel capacity as they got longer; but as they got longer, the decoding process also grew more complex–far too complex for the computers of the time. Coding researchers knew, of course, that computers would improve. But no one knew whose codes those improvements would favor.
Nonetheless, MIT immediately hired Gallager as a faculty member on the strength of his thesis. In the ensuing years, while his own coding scheme languished in obscurity, he taught and mentored a wave of brilliant students–including Massey, Forney, and Berlekamp–whose contributions to coding theory had more immediate practical implications than his own.
Gallager, however, seems as unruffled by the long neglect of his codes as he does by their recent revival–perhaps because he always took the long view. “He has a knack of inventing things that lie dormant for tens of years until people suddenly realize it’s pretty good stuff,” says Vincent Chan ‘71, MS ‘71, EE ‘72, PhD ‘74, a professor of electrical engineering who still displays by his desk the doorplate from the office he once shared with Shannon. Chan recalls a recent visit to the labs of a major software company, where a researcher was boasting of a new compression technique that would enable video files to take up only a hundredth as much memory as they do now. Chan felt obliged to point out that Gallager had introduced the technique in 1974. “Many of these ideas take quite a bit of time to think through,” he says, “and at the time that you’re thinking them through, there are many, many options. And you really have to think very carefully and maybe over a long period of time before you figure out which one is the right one. Bob does that a lot.”
Muriel Médard ‘89, ‘90, MS ‘91, ScD ‘95, an information theorist in the Research Lab of Electronics, agrees. “Bob wasn’t running around trying to publish and make sure he wasn’t getting scooped,” she says. For instance, Médard recalls a conversation between Gallager and a prominent younger information theorist, who in describing his own work cited a recently proved theorem that it relied upon. “Bob starts rummaging through stuff, the way he does,” Médard says. Eventually he produced a tattered copy of one of his own papers. “He had this teensy-weensy little proof,” Médard says. “And it was like a footnote. A thick footnote, but a footnote. ‘They named that?’ ‘Yeah, Bob, it’s a major theorem now.’ ”
Today, Gallager’s codes underlie the approaches that come closest to the maximum data rate for a given communication channel–closer even than turbo codes. In addition to their applications in telecommunications, they’re beginning to replace the older codes used to protect data in disk drives and other storage devices.
To people like Forney, who were at MIT during what he calls the “golden age” of coding theory, the fact that the challenge issued by Shannon’s 1948 paper has been met is somewhat bittersweet. “Those of us who know and love coding are reluctant to say that the problem has been completely solved,” says Forney. “But it’s true that most people have moved on to other things.”
“From 1950 to 1965, MIT was the hotbed of information theory,” says Joel West. “It really was a golden age.”
AI and robotics are changing the future of work. Learn from the humans leading the way at EmTech Next 2019.Register now