Internet Categorization and Search: A Self-Organizing Approach

Abstract The problems of information overload and vocabulary differences have become more pressing with the emergence of increasingly popular Internet services. The main information retrieval mechanisms provided by the prevailing Internet WWW software are based on either keyword search (e.g., the Lycos server at CMU, the Yahoo server at Stanford) or hypertext browsing (e.g., Mosaic and Netscape). This research aims to provide an alternative concept-based categorization and search capability for WWW servers based on selected machine learning algorithms. Our proposed approach, which is grounded on automatic textual analysis of Internet documents (homepages), attempts to address the Internet search problem by first categorizing the content of Internet documents. We report results of our recent testing of a multilayered neural network clustering algorithm employing the Kohonen self-organizing feature map to categorize (classify) Internet homepages according to their content. The category hierarchies created could serve to partition the vast Internet services into subject-specific categories and databases and improve Internet keyword searching and/or browsing.

[1]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[2]  P. K. Simpson,et al.  Fuzzy min-max neural networks , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3]  David D. Lewis,et al.  Learning in Intelligent Information Retrieval , 1991, ML.

[4]  Hsinchun Chen,et al.  GANNET: A Machine Learning Approach to Document Retrieval , 1994, J. Manag. Inf. Syst..

[5]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[6]  K. J. Lynch,et al.  Generating, integrating, and activating thesauri for concept-based document retrieval , 1993, IEEE Expert.

[7]  Stuart L. Crawford,et al.  Classification Trees for Information Retrieval , 1991, ML.

[8]  Karen A. Frenkel,et al.  The human genome project and informatics , 1991, CACM.

[9]  Hsinchun Chen,et al.  Collaborative systems: solving the vocabulary problem , 1994, Computer.

[10]  Pattie Maes,et al.  Agents that reduce work and information overload , 1994, CACM.

[11]  Gerard Salton,et al.  Automatic Information Retrieval , 1980, Computer.

[12]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[13]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[14]  Martijn Koster,et al.  ALIWEB - Archie-like Indexing in the WEB , 1994, Comput. Networks ISDN Syst..

[15]  J Courteau Genome databases. , 1991, Science.

[16]  Kevin Knight,et al.  Connectionist ideas and algorithms , 1990, CACM.

[17]  Gerard Salton,et al.  Generation and search of clustered files , 1978, TODS.

[18]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[19]  Lisa F. Rau,et al.  Creating segmented databases from free text for text retrieval , 1991, SIGIR '91.

[20]  Oren Etzioni,et al.  A softbot-based interface to the Internet , 1994, CACM.

[21]  S Mangrulkar,et al.  Artificial neural systems. , 1990, ISA transactions.

[22]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.