Gleaning the Web

which threatens to swamp the Internet's promised productivity gains, educational benefits, and entertainment value. In recent years, computer science has risen to this challenge, with substantial progress on systems for retrieving and filtering text. Information extraction systems provide a complementary service. IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. IE from weather reports, for example, might involve identifying locations, dates, and high and low temperatures. IE from apartment listings would find neighborhoods, numbers of bedrooms, rents, and telephone numbers. Scalability is the major challenge to IE. IE systems usually rely on extraction rules tailored to a particular document collection. If this knowledge is hand-crafted, porting an IE system to new collections will be expensive. Recent research has led to the identification of important classes of Internet IE tasks for which highly scal-able systems have been developed. In this article, I • describe these IE tasks and explain how machine learning yields highly scalable IE systems, and • discuss remaining challenges and argue that scaling up AI applications on the Internet is an important challenge to machine learning. IE is a key enabling technology for several Internet-taming strategies, such as the information integration systems that present a unified view of heterogenous Inter-net sources. 1,2 A movie-information inte-grator, for example, might provide a single query interface to the movie review, cast list, and schedule information available from dozens of Internet sites. Users pose queries to the interface. The integrator decomposes each query into subqueries against the relevant sources, and then combines the results (see Figure 1). Information integration systems operate by interpreting the Internet sites not as free text, but as structured database-like knowledge sources. To do so, the system processes site documents to extract the relevant text fragments (movie titles, actor names, and so on), while discarding extraneous material such as HTML tags or advertisements. The integration system uses a library of wrappers—each wrapper is an IE system customized for a particular Internet site (see Figure 2). One might expect this Internet IE task to be inherently unscalable because • source documents are designed for people , and few sites provide machine-readable specifications of their formatting conventions; • ad hoc formatting conventions used at one site are rarely relevant elsewhere, so a new wrapper must be built for each additional site; and • sites often change their formatting—a wrapper that …