Data Clustering: From Documents to the Web

The chapter provides a survey of some clustering methods relevant to the clustering document collections and, in consequence, Web data. We start with classical methods of cluster analysis which seem to be relevant in approaching to cluster Web data. The graph clustering is also described since its methods contribute significantly to clustering Web data. A use of artificial neural networks for clustering has the same motivation. Based on previously presented material, the core of the chapter provides an overview of approaches to clustering in the Web environment. Particularly, we focus on clustering web search results, in which clustering search engines arrange the search results into groups around a common theme. We conclude with some general considerations concerning the justification of so many clustering algorithms and their application in the Web environment.

[1]  Mark E. J. Newman,et al.  Technological Networks and the Spread of Computer Viruses , 2004, Science.

[2]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[3]  Cathy H. Wu,et al.  Gene Classification Artificial Neural System , 1995, Int. J. Artif. Intell. Tools.

[4]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[5]  Louis Massey,et al.  On the quality of ART1 text clustering , 2003, Neural Networks.

[6]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[7]  Stephen Grossberg,et al.  ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition , 1991, Neural Networks.

[8]  Stephen Grossberg,et al.  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps , 1992, IEEE Trans. Neural Networks.

[9]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  S. Bornholdt,et al.  Scale-free topology of e-mail networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Cathy H. Wu,et al.  Motif identification neural design for rapid and sensitive protein family search , 1996, Comput. Appl. Biosci..

[14]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[15]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[18]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[19]  V. Karkaletsis,et al.  Construction of Web Community Directories using Document Clustering and Web Usage Mining , 2004 .

[20]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[21]  Lada A. Adamic The Small World Web , 1999, ECDL.

[22]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[23]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[24]  S. Grossberg,et al.  ART 2: self-organization of stable category recognition codes for analog input patterns. , 1987, Applied optics.

[25]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[27]  Stephen Grossberg,et al.  ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures , 1990, Neural Networks.

[28]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[29]  Michael D. Rice,et al.  Clusters, Concepts, and Pseudometrics , 2001, MFCSIT.

[30]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[31]  Geoffrey E. Hinton,et al.  Parallel Models of Associative Memory , 1989 .

[32]  A Díaz-Guilera,et al.  Self-similar community structure in a network of human interactions. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[33]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  D. Matula k-Components, Clusters and Slicings in Graphs , 1972 .

[35]  D. Matula Graph Theoretic Techniques for Cluster Analysis Algorithms , 1977 .

[36]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[37]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[38]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[39]  M. Rauch,et al.  Improved data structures for fully dynamic biconnectivity , 1994, STOC '94.

[40]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[41]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[42]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[43]  Chris H. Q. Ding,et al.  Automatic topic identification using webpage clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[44]  Sankar K. Pal,et al.  Web mining in soft computing framework: relevance, state of the art and future directions , 2002, IEEE Trans. Neural Networks.

[45]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[46]  B Kosko,et al.  Adaptive bidirectional associative memories. , 1987, Applied optics.

[47]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[48]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[49]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[50]  D. P. Mercer,et al.  Clustering large datasets , 2003 .

[51]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[52]  Stephen Grossberg,et al.  ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network , 1991, [1991 Proceedings] IEEE Conference on Neural Networks for Ocean Engineering.

[53]  J. Dopazo,et al.  Phylogenetic Reconstruction Using an Unsupervised Growing Neural Network That Adopts the Topology of a Phylogenetic Tree , 1997, Journal of Molecular Evolution.

[54]  Joaquín Dopazo,et al.  Self-organizing tree growing network for classifying amino acids , 1998 .

[55]  S. Grossberg,et al.  Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors , 1976, Biological Cybernetics.

[56]  Adrian A Canutescu,et al.  Access the most recent version at doi: 10.1110/ps.03154503 References , 2003 .

[57]  R Sásik,et al.  Percolation clustering: a novel approach to the clustering of gene expression patterns in Dictyostelium development. , 2001, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[58]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[59]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[60]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[61]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 2005 .

[62]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[63]  Xerox,et al.  The Small World , 1999 .

[64]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[65]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[66]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[67]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[68]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[69]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[70]  Jung-Hyun Lee,et al.  A Bayesian neural network model for dynamic web document clustering , 1999, Proceedings of IEEE. IEEE Region 10 Conference. TENCON 99. 'Multimedia Technology for Asia-Pacific Information Infrastructure' (Cat. No.99CH37030).

[71]  Anupam Joshi,et al.  Retriever: Improving Web Search Engine Results Using Clustering , 2000 .

[72]  Yuzo Hirai,et al.  Dynamics of selective recall in an associative memory model with one-to-many associations , 1999, IEEE Trans. Neural Networks.

[73]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[74]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[75]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[76]  YUHUI YAO,et al.  Associative Clustering for Clusters of Arbitrary Distribution Shapes , 2001, Neural Processing Letters.

[77]  Taizo Hanai,et al.  Gene Expression Analysis Using Fuzzy ART , 2001 .

[78]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[79]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[80]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[81]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[82]  Stephen Grossberg,et al.  Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions , 1976, Biological Cybernetics.

[83]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[84]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[85]  GrossbergS. Adaptive pattern classification and universal recoding , 1976 .

[86]  Stephen Grossberg,et al.  A fuzzy ARTMAP nonparametric probability estimator for nonstationary pattern recognition problems , 1995, IEEE Trans. Neural Networks.

[87]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[88]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[89]  Monika Henzinger,et al.  Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[90]  Xiaodi Huang,et al.  Identification of clusters in the Web graph based on link topology , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[91]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.