This is a live blog of Lecture 1 (part 1 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
In his landmark 1948 paper “A Mathematical Theory of Communication,” Claude Shannon laid the foundations for modern digital communication. Most engineers know this work as the origin of channel capacity. As long as you code your signal properly, you can send a digital signal through media with tons of noise, interference, and loss, and the recipient can almost perfectly decode it on the other side. Shannon’s paper laid the foundations of all modern digital communications. We have Shannon to thank for the ability to steam TikTok from anywhere on Earth.
Shannon’s paper is unfathomably deep. A 45-page theory paper at a modern machine learning conference might prove that some abstract quantity equals two rather than four. Shannon’s paper revolutionized electrical engineering and launched multiple research fields. Communications Theory. Information Theory. And, notably, Natural Language Processing.
Why language processing? The first part of Shannon’s paper focused on what is possible to transmit when the communication channel has no noise. This is a reasonable baseline to which noisy channels can be compared. He remarks that if a signal has some redundancy, we can encode the information into a smaller representation, send it through the channel, and decode it on the other side. Encoding and decoding would mean fewer bits sent through the channel and, hence, faster communication times. If signals are predictable, we can transmit them more efficiently. Channel coding is all about prediction.
Shannon wanted to know how predictable was English text. Under a naive coding scheme, each character in the alphabet plus the space can be encoded with 5 binary bits. This is because there are 32 strings of ones and zeros with length five, but there are only 27 characters. But can we do better?
To figure out the redundancy of text, Shannon built models to simulate text. His models used probability: he would choose the next character or sequence of characters by sampling from some probability distribution. What was a good distribution? Shannon chose to use Markov Processes: the next symbol’s probability would be a function of a previous window of text.
Shannon presented the simulation of several different Markov approximations of English. The zeroth order approximation to English generates characters completely at random, each with probability 1/27:
But we know all letters don’t occur with the same frequencies. If we built a model that generated letters at random but with probability equal to their frequency in written English, we get text like this:
Shannon then showed that if you build a probability model where the next character’s probability is equal to how frequently it occurs after the previous two letters, you start to get text that looks like Latin:
Shannon got more sophisticated, switching from characters to words. Sampling words according to their frequency in common corpora yielded this passage:
But when he treated the words as depending on each other, something magical emerged. Here’s what happened when the probability of the next word was equal to the frequency it occurred after the previous two words. The passage starts to look like English:
Shannon generated this text without a computer. Computers wouldn’t come to Bell Labs for another several years.
Shannon went on to claim that the way to measure the information content in text was to build a really good model that approximated text and then compute the conditional entropy of a symbol based on the past symbols. In modern machine learning lingo, this is the cross-entropy loss of the language model evaluated on all text (what the OpenAI guys spend billions of Microsoft dollars computing). Shannon cared about entropy as he would prove that the amount of information per bit that you could send down a noiseless channel was exactly equal to this conditional entropy.
But how much information was in language? He tried to estimate this in his second paper “Prediction and Entropy of Printed English.” Shannon computed the information content in his simple language models. For the naive language model with spaces, it was 4.8 bits per letter. Using letter frequencies, this number went down to 4.0 bits per letter. If digrams were used (predicting a letter based on the previous two), the number was closer to 3.3. Using about 100 characters, Shannon estimated that the entropy was around 1 bit, somewhere between 0.6 and 1.3. 70 years later, using thousands of tokens to predict the next, this estimate seems about right. This recent paper says 0.7 bits per character.
Shannon proposed many things before 1950:
Language is predictable. Written English is approximately 85% redundant.
Language can be modeled by predicting the next character, word, or token based on the prior text. Statistical models can capture the same prediction rate as people.
The way to build these language models is to analyze many corpora of English to fill in the right statistics.
The way to evaluate the quality of approximation was through cross-entropy.
The Sparks of Artificial General Intelligence were flying at Bell Labs 75 years ago.
I joke, but combining Shannon’s language models with Rosenblatt’s perceptron (1955) and some function approximation gives us our modern language model. Or as the absurdly hyperbolic among us call it, artificial general intelligence. But all of the elements of these models were developed before 1965. So why did it take us so long to get to chatGPT? What happened in the interim? That’s the question I’ll repeatedly ask through the first half of this course. Let me close with two partial answers to my question:
First, we needed to wait for computers to speed up. Shannon did his work on language (and on computer chess) before computers existed. Rosenblatt did his work on the Perceptron on an IBM 704, a computer less powerful than the controller in your microwave. In computer science, we take the transistor revolution for granted.
Second, we definitely did learn things as we refined our craft over 75 years. Almost every method we use in machine learning today was invented before 1965. But we made many important insights along the way to webscale machine learning systems. What were those? We sadly, but not surprisingly, developed many useless distractions in both theory and practice. I will spend less time those, but at some point want to write up a graveyard of ideas that have been (or should be) abandoned in machine learning. We learned a lot by trying alternatives that didn’t work.
Oddly, the other thing that always strikes me is the relentless optimism and over-promising that has followed machine learning since its inception. In a 1961 documentary “The Thinking Machine” produced by CBS to honor MIT’s 100th birthday, Shannon is quoted as saying:
“In discussing the problem of simulating the human brain on a computing machine, we must carefully distinguish between the accomplishments of the past and what we hope to do in the future. Certainly the accomplishments of the past have been most impressive. We have machines that will translate to some extent from one language to another. Machines that will prove mathematical theorems. Machines that will play chess or checkers sometimes even better than the men who designed them. These however are in the line of special-purpose computers aimed at particular specific problems.
“What we would like in the future is a more general computing system capable of learning by experience and forming inductive and deductive thoughts. This would probably consist of three main parts. In the first place that would be sense organs, akin to the human eye or ear, whereby the machine can take cognizance of events in its environment. In the second place there would be a large, general-purpose, flexible computer programmed to learn from experience, to form concepts, and capable of doing logic. In the third place, there will be output devices. Devices in the nature of the human hand capable of allowing a machine to make use of the thoughts that has had of the cognitive processes in order to actually affect the environment. Work is going on in all of these fronts simultaneously and rapid progress is being made.
“I confidently expect that within 10 or 15 years we will find emerging from the laboratories something not too far from the robotic science fiction fame. In any case, whatever the result, this is certainly one of the most challenging and exciting areas of modern scientific work.”
Though he is more responsible than almost anyone else for our modern computers and communication networks, Shannon was not a perfect predictor. We’ve been 10 to 15 years away from Science Fiction since 1950. But look at the problems he lists in that first paragraph as mostly solved in 1961. They are more or less solved today. Computers can translate effectively for many purposes. Computers can automatically prove many theorems, or at least give us a lot of assistance. Computers can beat anyone at checkers or chess. My class will explore why those first things came to pass. And maybe then we can discuss why we didn’t solve those problems in the second paragraph of Shannon’s quote.
Would you agree that our brain likely implements a simple language model -- something that can be relatively easily captured by statistics?
Sorry, I cannot agree. We do not predict the next character, but we may figure out the continuation given sufficient piece of text to start with. Figuring out is different from prediction. The game 20 Questions is different from "heads or tails".
Intelligence is the ability to handle differences, its core algorithm is based on comparisons. All of that is enabled by comparable properties.
In language, references are stacked filters that differentiate the relevant objects from the context. Sentences follow formulas composed of constituents. Note that question words address constituents. Questions provide all the other constituents and ask for an unknown one. We store many sentences in our head - answering questions is just comparing provided constituents with those in the stored records.
The above process is heavily based on generalization, which is also based on differences and comparable properties, not on similarities. We introduce differentiating factors to the parent class during specialization. During generalization, we ignore differentiating factors.
Differences rule!