Regularized latent semantic indexing

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

[1]  Xi Chen,et al.  Sparse Latent Semantic Analysis , 2011, SDM.

[2]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[3]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[7]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[8]  Wei Wu,et al.  A kernel approach to addressing term mismatch , 2011, WWW.

[9]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[10]  Ziqi Wang,et al.  A Fast and Accurate Method for Approximate String Search , 2011, ACL.

[11]  Chao Liu,et al.  Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[12]  Michael Elad,et al.  Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation , 2010, IEEE Transactions on Signal Processing.

[13]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[14]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[15]  Chris H. Q. Ding,et al.  On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[16]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[17]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[18]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[19]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[20]  Wei Wu,et al.  Learning a Robust Relevance Model for Search Using Kernel Methods , 2011, J. Mach. Learn. Res..

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[23]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[24]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[25]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[26]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[27]  Max Welling,et al.  Asynchronous distributed estimation of topic models for document analysis , 2011, Statistical Methodology.

[28]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[29]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[30]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[31]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[32]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[33]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[36]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[37]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[38]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  Geoffrey J. Gordon,et al.  A Unified View of Matrix Factorization Models , 2008, ECML/PKDD.

[41]  Thomas Hofmann,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[42]  Hang Li,et al.  Relevance Ranking Using Kernels , 2010, AIRS.