Categorizing Web pages on the subject of neural networks

Most of the existing techniques for page classification on the World Wide Web are based on text only analysis. Recently, several hypertext clustering algorithms have been proposed. These provide promising results when the clustering is based on combined term-similarity and hyperlink-similarity measures. However, both the traditional and the advanced techniques require improvements in the term- or word-vector representation of Web pages, especially when applied to Web collections dealing with one or a few particular topics. In this work we introduce an autonomous agent for hypertext classification which is implemented in Java. This paper describes the development related to text-only analysis, including a modification of a well known rule for information retrieval, and the utilization of word correlation. The algorithm has been employed in clustering Web pages related to the subject of neural networks. The results are useful in arriving at an efficient term-vector representation, in order to achieve a rapid and appropriate clustering based on content of on-line documents. The term vectors derived using this algorithm have been classified using a modified adaptive resonance theory (ART) algorithm, an unsupervised learning method in artificial neural networks which is proven to provide very accurate and sophisticated clustering. Examples of the results are presented in the paper, suggesting several benefits of using the methods.1998 Academic Press

[1]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[2]  Dan Harkey,et al.  The Essential Client/Server Survival Guide, 2nd Edition , 1996 .

[3]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[4]  Michelle Q. Wang Baldonado,et al.  Real-Time Full-Text Clustering of Networked Documents , 1997, AAAI/IAAI.

[5]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[6]  Chaomei Chen Structuring and visualising the WWW by generalised similarity analysis , 1997, HYPERTEXT '97.

[7]  Dan Harkey,et al.  Essential client/server survival guide , 1994 .

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[10]  Gerard Salton,et al.  Approaches to Global Text Analysis , 1990 .

[11]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[12]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[13]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[14]  Howard C. Card,et al.  Categorizing Web pages using modified ART , 1998, Conference Proceedings. IEEE Canadian Conference on Electrical and Computer Engineering (Cat. No.98TH8341).