A MapReduce based distributed LSI

Latent Semantic Indexing is a widely used text mining technology nowadays due its effectiveness in dealing with the problems of synonymy and polysemy within a proper matrix scale. However LSI is enormously computationally intensive especially for processing large scale data. And effective solution is to increase the computational power available to LSI using multiple computing nodes. In this paper we propose a novel MapReduce based distributed LSI using Hadoop distributed computing architecture to implement K-means algorithm to cluster the documents and then using LSI on the clustered results. We evaluated the performances of the proposed MapReduce based LSI and comparison are made with standalone LSI. The results show a great improvement of LSI's performance in terms of speed

[1]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[2]  Jason Venner Getting Started with Hadoop Core , 2009 .

[3]  Jing Gao,et al.  Clustered SVD strategies in latent semantic indexing , 2005, Inf. Process. Manag..

[4]  Nianjun Liu,et al.  A latent semantic indexing and WordNet based information retrieval model for digital forensics , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[5]  Paolo Rosso,et al.  The influence of semantics in IR using LSI and K-means clustering techniques , 2003, ISICT.

[6]  C. Kumar,et al.  Latent Semantic Indexing using eigenvalue analysis for efficient information retrieval , 2006 .

[7]  Wei Song,et al.  Analysis of Web Clustering Based on Genetic Algorithm with Latent Semantic Indexing Technology , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).

[8]  Bing Yan,et al.  The New Clustering Strategy and Algorithm Based on Latent Semantic Indexing , 2008, 2008 Fourth International Conference on Natural Computation.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Gabriel Ok,et al.  PARALLEL SVD COMPUTATION IN UPDATING PROBLEMS OF LATENT SEMANTIC INDEXING , 2002 .

[14]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[15]  Steffen Staab,et al.  Text clustering based on good aggregations , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.