Knowledge discovery in the Internet

With the rapid expansion of the World Wide Web, the need for efficient data retrieval strategies becomes stronger and will be still growing. Unfortunately classical information retrieval techniques, developed for well-organized collections of textual data do not seem to be able to cope with diversity and amount of information available throughout the Internet. This paper presents some of the newest approaches to information retrieval in large, unstructured hypertext spaces such as WWW that focus more on latent information embedded in hyperlinks and document structure, then on actual understanding of Web pages textual content. These techniques, that are marking the new trends and prospects for the Internet technology, have been given recently the name "Web mining", as in fact they are examples of unsupervised machine learning similar to data mining and text mining. Here we discuss methods belonging to the following three groups: link topology analysis, statistical text analysis and query languages and systems design.

[1]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[2]  Susan T. Dumais,et al.  Computational Methods for Intelligent Information Access , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  George A. Mihaila WebSQL - An SQL-like Query Language for the World Wide Web , 1996 .

[5]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[6]  Jiawei Han,et al.  Resource and knowledge discovery from the internet and multimedia repositories , 1999 .

[7]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[8]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[9]  Udi Manber Future directions and research problems in the World Wide Web , 1996, PODS '96.

[10]  Timo Honkela,et al.  Exploration of full-text databases with self-organizing maps , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[11]  Teuvo Kohonen,et al.  Self-Organization of Very Large Document Collections: State of the Art , 1998 .

[12]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[13]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[14]  M.W. Berry,et al.  Computational Methods for Intelligent Information Access , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[15]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[16]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1998, SODA '98.

[17]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[18]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[19]  Alberto O. Mendelzon,et al.  Formal models of Web queries , 1997, Inf. Syst..

[20]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[21]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[22]  M. A. Merzbacher Discovering Semantic Proximity for Web Pages , 1999, ISMIS.

[23]  I. V. Ramakrishnan,et al.  A layered architecture for querying dynamic Web content , 1999, SIGMOD '99.

[24]  Jiawei Han,et al.  WebML: Querying the World-Wide Web for Resources and Knowledge , 1998, Workshop on Web Information and Data Management.

[25]  Jacob C. Baas What You Seek Is What You Get. , 1986 .

[26]  Zoé Lacroix,et al.  User-oriented smart-cache for the Web: what you seek is what you get! , 1998, SIGMOD '98.

[27]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.

[28]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[29]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.