Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering

In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[4]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[5]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[6]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[7]  M. Gajęcki,et al.  Automatyczna klasyfikacja rzeczowników do grup semantycznych na podstawie korpusu tekstów , 2003 .

[8]  Jacek Kitowski,et al.  A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language , 2007, Comput. Informatics.

[9]  Marcin Wolinski,et al.  Morfeusz - a Practical Tool for the Morphological Analysis of Polish , 2006, Intelligent Information Systems.

[10]  Jacek Kitowski,et al.  Clustering Polish Texts with Latent Semantic Analysis , 2010, ICAISC.

[11]  Maciej Piasecki,et al.  Experiments in Documents Clustering for the Automatic Acquisition of Lexical Semantic Networks for Polish , 2008 .

[12]  Zoran Budimac,et al.  Text Categorization and Sorting of Web Search Results , 2009, Comput. Informatics.

[13]  George Karypis,et al.  Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[14]  Michał Korzycki A Dictionary based Stemming Mechanism for Polish , 2016 .

[15]  S. Dumais Latent Semantic Analysis. , 2005 .

[16]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[17]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[19]  Jacek Kitowski,et al.  Benchmarking High Performance Architectures with Natural Language Processing Algorithms , 2011, Comput. Sci..

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .