A computer learns the hard way: By reading the InternetS

At Carnegie-Mellon university, a massive computer system called NELL (Never Ending Language Learner) is systematically reading the internet and analyzing sentences for semantic categories and facts, teaching itself English and educating itself in human affairs. We spoke to NELL's creators.

NELL reads the Web 24 hours a day, seven days a week, learning language like a human would — cumulatively, over a long period of time. It parses text on the Internet for ontological categories, like "plants," "music" and "sports teams," then uses contextual clues to sort out what things belong in which categories, like "Nirvana is a grunge band" and "Peyton Manning plays for the Indianapolis Colts." And, perhaps most Skynet-horror-inducing, "anger is an emotion."

A computer learns the hard way: By reading the InternetS

In these estimations, NELL is 87 percent correct. And the more it learns, the more accurate it will become. Like the premise of a dystopian sci-fi story, Read the Web is both wonderful and terrifying — not unlike the idea of a "Semantic Web," an Internet as comprehensible to computers as it is to humans, which has been in the computer science and AI discourse for years.

Upon discovering this project, I had tons of questions about NELL: could it read other languages? Who gets the data in the end? Does it have parental controls on? To find out, we talked to Professor Tom Mitchell, chair of the Machine Learning Department of the School of Computer Science at Carnegie Mellon University, and Burr Settles, a Carnegie Mellon postdoctoral fellow working on the project.

At the moment, NELL is learning language and semantic categories in English, which would mean that its learning is limited to the output of the English-speaking world. Are there any plans to expand the program to different languages?

Professor Tom Mitchell: Interestingly, NELL's learning methods can apply equally well to other western languages as they do to English (as long as the language uses the same character set as English). We started with English because, well, we speak English. And also because that is the most-used language on the web, and we wanted NELL to have access to lots of text.

Burr Settles: In principle, the technology driving NELL is language-independent, so there is reason to believe that, given a corpus of Spanish or Chinese, it could learn equally as well. In fact, I suspect there are some languages it would perform even better with; for example syntax and orthography are generally more consistent in Spanish than in English, so the Spanish NELL might learn much more quickly and accurately.

Could an advanced NELL-like computer teach itself another language?

Burr Settles: Quite possibly. For example, imagine that NELL learns a lot about The French Revolution from English-language documents, and also knows (because we say so, or maybe because it read so!) that Wikipedia pages have corresponding translations in other languages. If NELL assumes the facts available on the English- and French-language Wikipedia pages for The French Revolution are roughly equivalent, then it could use its Knowledge to start to infer patterns, rules, word morphologies, etc. in French, and then start reading other French-language documents.

This isn't unlike the way humans can easily pick up certain words (concrete nouns, prepositions) when traveling in foreign-language countries. I know, because I just got back from two weeks in Spain, which is why I'm absent from that fabulous New York Times photo!

When will NELL stop running?

Professor Tom Mitchell: We have absolutely no intention of stopping it from running. NELL stands for "Never Ending Language Learner." We mean it, though of course we need to make research progress if we want to give it the ability to continue learning in useful ways.

Is NELL reading the web indiscriminately, or have you set it loose on particular corners of the Internet that are more conducive to language-learning (say, Wikipedia)?

Professor Tom Mitchell: NELL primarily uses a collection of 500,000,000 web pages that represent the most broadly popular, highly referenced pages on the web. But it also uses Google's search engine to search for additional pages when it is looking for targeted information (e.g., for pages that will teach it more about sports teams). So it's not in some corner of the web, but all over it.

Burr Settles: Currently, NELL reads indiscriminately. Of course, it tends to learn about proteins and cell lines mostly from biomedical documents, celebrities from news sites and gossip forums, and so on. In future versions of NELL, we hope it can decide its own learning agenda, e.g., "I've not read much about musical acts from the 1940s... maybe I'll focus on those kinds of documents today!" Or, alternatively, we could say we need it to focus on a particular document. Previous successes in "machine reading" research have in fact relied on a narrow scope of knowledge (e.g., only articles about sports, or terrorism, or biomedical research) in order to learn anything. The fact that NELL learns to read reasonably well across all of these domains is actually a big step forward.

It has been interesting to hear the public's response to NELL. There are many jokes about what will happen when it comes across 4chan or LOLcats, for example. But the reality is, those texts are already available to NELL, and it is largely ignoring them because they are so ill-formed and inconsistent.

Say NELL learns the English language well enough to be a Shakespearean scholar. What happens to the data then — do Google and Yahoo and DARPA get access to it?

Professor Tom Mitchell: Yes, and so will everybody. Already we have put NELL's growing knowledge base up on the web. You can browse it, and also download the whole thing if you like. Furthermore, I am committed to sticking to this policy of making NELL's extracted knowledge base available for free to anybody who wants to use it for any commercial or non-commercial purpose, for the life of this research project.

Lastly, the name NELL is a joke about the Jodie Foster movie, right?

Professor Tom Mitchell: Well, no. I didn't really know about that movie...but I just took a look at NELL's knowledge base, and it appears to know about it. Take a look. There, the light grey items are low confidence hypotheses that NELL is considering but not yet committing to. The dark black items are higher confidence beliefs. So it is considering that NELL might be a movie, a disease, and/or a writer, but it's pretty confident that Jodie Foster starred in the movie.

Top image: Jeff Swensen for New York Times.

A longer version of this article by Claire Evans originally appeared over at Universe.