A term-based algorithm for hierarchical clustering of Web documents

In this paper we introduce the novel class hierarchy construction algorithm (CHCA) in order to create hierarchical clusterings of Web documents. Unlike most clustering methods, CHCA operates on nominal data (the words occurring in each document) and it differs from other hierarchical clustering techniques in that it uses the object-oriented concept of inheritance to create the parent/child relationship between clusters. A prototype system has been developed using CHCA to create cluster hierarchies from web search results returned by conventional search engines. CHCA, without any guidance, creates term-based clusters from the contents of the retrieved pages and assigns each page to a cluster; the clusters correspond to topics and sub-topics in the investigated domain. The performance of our system is compared with a similar web search clustering system (Vivisimo).

[1]  P. Boeck,et al.  Hierarchical classes: Model and data analysis , 1988 .

[2]  Maristella Agosti,et al.  Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[3]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  Anupam,et al.  Mining Web Access Logs Using Relational Competitive Fuzzy Clustering , 1999 .

[6]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[7]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[10]  George Luger,et al.  Artificial Intelligence: Structures and Strategies for Complex Problem Solving (5th Edition) , 2004 .

[11]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[12]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[13]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[14]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[15]  Chia-Hui Chang,et al.  Customizable Multi-Engine Search Tool with Clustering , 1997, Comput. Networks.

[16]  Iven Van Mechelen,et al.  A HIERARCHICAL CLASSES MODEL: THEORY AND METHOD WITH APPLICATIONS IN PSYCHOLOGY AND PSYCHOPATHOLOGY , 1996 .

[17]  M. Klemettinen,et al.  Applying Data Mining Techniques in Text Analysis , 1997 .

[18]  M. A. Merzbacher Discovering Semantic Proximity for Web Pages , 1999, ISMIS.

[19]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[20]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..