On principal component analysis, cosine and Euclidean measures in information retrieval

Abstract Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  Shamik Sural,et al.  Similarity between Euclidean and cosine angle distance for nearest neighbor queries , 2004, SAC '04.

[3]  Le Zhao,et al.  Improved Feature Selection and Redundance Computing - THUIR at TREC 2004 Novelty Track , 2004, TREC.

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  Patrick van Bommel,et al.  Measuring the incremental information value of documents , 2006, Inf. Sci..

[7]  Mirjam Sepesy Maucec,et al.  Modelling highly inflected languages , 2004, Inf. Sci..

[8]  Verayuth Lertnattee,et al.  Class normalization in centroid-based text categorization , 2006, Inf. Sci..

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  C. J. van Rijsbergen,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2001 .

[11]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[12]  A. C. Rencher Methods of multivariate analysis , 1995 .

[13]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[14]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[15]  Slawomir Zadrozny,et al.  Computing with words for text processing: An approach to the text categorization , 2006, Inf. Sci..

[16]  Jaana Kekäläinen,et al.  The effects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval , 1999 .

[17]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[18]  G. Dunteman Principal Components Analysis , 1989 .

[19]  Brian Everitt,et al.  Cluster analysis , 1974 .

[20]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[21]  Edie M. Rasmussen,et al.  Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[22]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[25]  Martti Juhola,et al.  Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments , 2005, Information Retrieval.

[26]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[27]  Subhash Sharma Applied multivariate techniques , 1995 .

[28]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[29]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[30]  Riitta Alkula From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software , 2004, Information Retrieval.

[31]  Dae-Won Kim,et al.  Exploiting concept clusters for content-based information retrieval , 2005, Inf. Sci..

[32]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[33]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[34]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[35]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.