AI watch: data mining and the Web

I f the data warehouse provides the ideal structure for data .mining and knowledge discovery [1], then the World-Wide Web, with its lack of structure, provides the greatest technical challenge for those who would use AI and statistical methods to glean knowledge from data. In a pair of articles [2, 3], Oren Etzioni discusses these challenges as well as some techniques used in overcoming them in several deployed Webbased systems. In [2], Etzioni convincingly argues that the lack of structure that characterizes the Web is only apparent: Large portions of the Web are multi-layered sites with data warehouse-like structure (on-line catalogues of merchandise); other portions of the Web have very characteristic features (home pages); still other portions are partially labeled by HTML annotations like <title> as well as the linguistic and typographic conventions of files in natural language, Postscript, Latex, and the like; and all Web servers have a domain name which serves to partially limit what they might contain. These last points are most important for systems which learn about the Web by coming to understand common tags and the content associated with certain domain names. In short, Etzioni believes, as this writer certainly does, that the Web is more a "gold mine" than a "quagmire." The systems Etzioni and his colleagues have deployed fall into three broad classes enumerated in [2]: (a) Resource discovery: "[1]ocating unfamiliar documents and services on the Web"--here the focus is on search; (b) Information extraction: "[a]utomatically extracting specific information from ... Web resources"-r-here the focus is on understanding; and (c) Generalization: "[u]ncovering general patterns at individual Web sites and across multiple sites"--here the focus is on learning. All three areas--search, understanding, and learning--are classical AI tasks. And, as for the last of these, learning, the Web-based systems discussed differ from those that merely interact with the user to learn his preferences and then search or act on his behalf; these systems all learn about the Web itself and do so by various methods including interaction and experience. In that sense they are more genuinely intelligent. Joseph S. Fulda, CSE, Ph..D. 701 lg~st 177th Street #21, New ~brk, N Y 10033 fidda@acm.org Copyright © 199Z Joseph S. Fulda