Vector Space Models for Search and Cluster Mining

This chapter consists of two parts: a review of search and cluster mining algorithms based on vector space modeling followed by a description of a prototype search and cluster mining system. In the review, we consider Latent Semantic Indexing (LSI), a method based on the Singular Value Decomposition (SVD) of the document attribute matrix and Principal Component Analysis (PCA) of the document vector covariance matrix. In the second part, we present novel techniques for mining major and minor clusters from massive databases based on enhancements of LSI and PCA and automatic labeling of clusters based on their document contents. Most mining systems have been designed to find major clusters and they often fail to report information on smaller minor clusters. Minor cluster identification is important in many business applications, such as detection of credit card fraud, profile analysis, and scientific data analysis. Another novel feature of our method is the recognition and preservation of naturally occurring overlaps among clusters. Cluster overlap analysis is important for multiperspective analysis of databases. Results from implementation studies with a prototype system using over 100,000 news articles demonstrate the effectiveness of search and clustering engines.

[1]  I. Jolliffe Principal Component Analysis , 2002 .

[2]  J. B. Rosen,et al.  Lower dimensional representation of text data in vector space based information retrieval , 2001 .

[3]  Andrew W. Moore,et al.  Mixtures of Rectangles: Interpretable Soft Clustering , 2001, ICML.

[4]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[5]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[6]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[7]  Michael J. Kirby,et al.  Estimation of Topological Dimension , 2003, SDM.

[8]  Ravikumar Kondadadi,et al.  A similarity-based soft clustering algorithm for documents , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[9]  J. B. Rosen,et al.  Lower Dimensional Representation of Text Data Based on Centroids and Least Squares , 2003 .

[10]  Gene H. Golub,et al.  Matrix computations , 1983 .

[11]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages , 1998 .

[12]  Masaki Aono,et al.  Matrix computations for information retrieval and major and outlier cluster detection , 2002 .

[13]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[14]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[15]  C. Eckart,et al.  A principal axis transformation for non-hermitian matrices , 1939 .

[16]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[17]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[18]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[19]  Chi-Hoon Lee,et al.  On Data Clustering Analysis: Scalability, Constraints, and Validation , 2002, PAKDD.

[20]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[21]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[22]  Michael E. Houle,et al.  Navigating massive data sets via local clustering , 2003, KDD '03.

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Dong-Hong Ji,et al.  Document clustering based on cluster validation , 2004, CIKM '04.

[25]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[26]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[27]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[28]  Marti A. Hearst The Use of Categories and Clusters for Organizing Retrieval Results , 1999 .

[29]  Axel Ruhe,et al.  Information retrieval using very short Krylov sequences , 2001 .

[30]  Joydeep Ghosh,et al.  GAMLS: a generalized framework for associative modular learning systems , 1999, Defense, Security, and Sensing.

[31]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[32]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[33]  Gregory James Hamerly,et al.  Learning structure and concepts in data through data clustering , 2003 .

[34]  Brian Everitt,et al.  Cluster analysis , 1974 .

[35]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[36]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[37]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[38]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[39]  Georges Dupret,et al.  Latent concepts and the number orthogonal factors in latent semantic analysis , 2003, SIGIR.

[40]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.