论文信息 - The World-Wide Web: quagmire or gold mine?

The World-Wide Web: quagmire or gold mine?

Skeptics believe the Web is too unstructured for Web mining to succeed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. Furthermore, much of the information on the Web is presented in natural-language text with no machine-readable semantics; HTML annotations structure the display of Web pages, but provide little insight into their content. Some have advocated transforming the Web into a massive layered database to facilitate data mining [12], but the Web is too dynamic and chaotic to be tamed in this manner. Others have attempted to hand code site-specific “wrappers” that facilitate the extraction of information from individual Web resources (e.g., [8]). Hand coding is convenient but cannot keep up with the explosive growth of the Web. As an alternative, this article argues for the structured Web hypothesis: Information on the Web is sufficiently structured to facilitate effective Web mining. Examples of Web structure include linguistic and typographic conventions, HTML annotations (e.g., <title>), classes of semi-structured documents (e.g., product catalogs), Web indices and directories, and much more. To support the structured Web hypothesis, this article will survey preliminary Web mining successes and suggest directions for future work. Web mining may be organized into the following subtasks:

Oren Etzioni | Oren Etzioni

[1] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[2] Oren Etzioni,et al. Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.

[3] Oren Etzioni,et al. A softbot-based interface to the Internet , 1994, CACM.

[4] Jiawei Han,et al. Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment , 1995, KDD.

[5] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[6] Oren Etzioni,et al. Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web , 1996, AI Mag..

[7] Kristian J. Hammond,et al. A Case-Based Approach to Knowledge Navigation , 1994, IJCAI.

[8] Oren Etzioni,et al. A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[9] Steven D. Whitehead,et al. Auto-FAQ: An Experiment in Cyberspace Leveraging , 1995, Comput. Networks ISDN Syst..

[10] Peter B. Danzig,et al. The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[11] Kristian J. Hammond,et al. FAQ finder: a case-based approach to knowledge navigation , 1995, Proceedings the 11th Conference on Artificial Intelligence for Applications.