Study of the Parallel Techniques for Dimensionality Reduction and Its Impact on Performance of the Text Processing Algorithms

The presented algorithms employ the Vector Space Model (VSM) and its enhancements such as TFIDF (Term Frequency Inverse Document Frequency). Vector space model suffers from curse of dimensionality. Therefore various dimensionality reduction algorithms are utilized. This paper deals with two of the most common ones i.e. Latent Semantic Indexing (LSI) and Random Projection (RP). It turns out that the size of a document corpus has a substantial impact on the processing time. Thus the authors introduce GPU based on acceleration of these techniques. A dedicated test set-up was created and a series of experiments were conducted which revealed important properties of the algorithms and their accuracy. They show that the random projection outperforms LSI in terms of computing speed at the expanse of results quality.

[1]  Bustami Yusuf,et al.  Singular Value Decomposition for dimensionality reduction in unsupervised text learning problems , 2010, 2010 2nd International Conference on Education Technology and Computer.

[2]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[4]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[5]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.

[8]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[9]  Mircea Andrecut,et al.  Parallel GPU Implementation of Iterative PCA Algorithms , 2008, J. Comput. Biol..

[10]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Lizhu Hao,et al.  Automatic Identification of Stop Words in Chinese Text Classification , 2008, 2008 International Conference on Computer Science and Software Engineering.

[13]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.