Comparison of Algorithms for Web Document Clustering Using Graph Representations of Data

In this paper we compare the performance of several popular clustering algorithms, including k-means, fuzzy c-means, hierarchical agglomerative, and graph partitioning. The novelty of this work is that the objects to be clustered are represented by graphs rather than the usual case of numeric feature vectors. We apply these techniques to web documents, which are represented by graphs instead of vectors, in order to perform web document clustering. Web documents are structured information sources and thus appropriate for modeling by graphs. We will examine the performance of each clustering algorithm when the web documents are represented as both graphs and vectors. This will allow us to investigate the applicability of each algorithm to the problem of web document clustering.

[1]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[2]  Petra Perner,et al.  Data Mining on Multimedia Data , 2002, Lecture Notes in Computer Science.

[3]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[4]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[5]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[6]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Abraham Kandel,et al.  Comparison of Distance Measures for Graph-Based Clustering of Documents , 2003, GbRPR.

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Miro Kraetzl,et al.  Graph distances using graph union , 2001, Pattern Recognit. Lett..

[10]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[11]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[12]  Abraham Kandel,et al.  Graph Representations for Web Document Clustering , 2003, IbPRIA.

[13]  Abraham Kandel,et al.  Clustering of Web Documents using a Graph Model , 2003, Web Document Analysis.

[14]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[15]  Gabriel Valiente,et al.  A graph distance metric combining maximum common subgraph and minimum common supergraph , 2001, Pattern Recognit. Lett..

[16]  Apostolos Antonacopoulos,et al.  Web Document Analysis: Challenges and Opportunities , 2003 .

[17]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[18]  Horst Bunke,et al.  On Median Graphs: Properties, Algorithms, and Applications , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[21]  Horst Bunke,et al.  On Graphs with Unique Node Labels , 2003, GbRPR.