A neighborhood-based approach for clustering of linked document collections

This paper addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or co-authorship among bibliographic data entries). Our approach is based on iterative relaxation of cluster assignments, and can be built on top of any clustering algorithm. This technique results in higher cluster purity, better overall accuracy, and make self-organization more robust.

[1]  Zoltan Kato,et al.  A Markov Random Field Image Segmentation Model Using Combined Color and Texture Features , 2001, CAIP.

[2]  Lionel Pelkowitz,et al.  A continuous relaxation labeling algorithm for Markov random fields , 1990, IEEE Trans. Syst. Man Cybern..

[3]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[4]  Chaur-Chin Chen,et al.  Color images' segmentation using scale space filter and markov random field , 1992, Pattern Recognit..

[5]  Jayanta Mukherjee MRF clustering for segmentation of color images , 2002, Pattern Recognit. Lett..

[6]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[7]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[15]  Stefan Siersdorfer,et al.  Restrictive clustering and metaclustering for self-organizing document collections , 2004, SIGIR '04.

[16]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[17]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[18]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[19]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[20]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[21]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[22]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[23]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[24]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[27]  Martin Ester,et al.  Knowledge Discovery in Databases , 2000 .