The World-Wide Web: quagmire or gold mine?

Skeptics believe the Web is too unstructured for Web mining to succeed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. Furthermore, much of the information on the Web is presented in natural-language text with no machine-readable semantics; HTML annotations structure the display of Web pages, but provide little insight into their content. Some have advocated transforming the Web into a massive layered database to facilitate data mining [12], but the Web is too dynamic and chaotic to be tamed in this manner. Others have attempted to hand code site-specific “wrappers” that facilitate the extraction of information from individual Web resources (e.g., [8]). Hand coding is convenient but cannot keep up with the explosive growth of the Web. As an alternative, this article argues for the structured Web hypothesis: Information on the Web is sufficiently structured to facilitate effective Web mining. Examples of Web structure include linguistic and typographic conventions, HTML annotations (e.g., <title>), classes of semi-structured documents (e.g., product catalogs), Web indices and directories, and much more. To support the structured Web hypothesis, this article will survey preliminary Web mining successes and suggest directions for future work. Web mining may be organized into the following subtasks: