Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

[1]  Qiang Yang,et al.  Scalable Parallel EM Algorithms for Latent Dirichlet Allocation in Multi-Core Systems , 2015, WWW.

[2]  Philip Resnik,et al.  Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation , 2010, EMNLP.

[3]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[4]  Inderjit S. Dhillon,et al.  Extreme Stochastic Variational Inference: Distributed and Asynchronous , 2016 .

[5]  Noah A. Smith,et al.  Textual Predictors of Bill Survival in Congressional Committees , 2012, NAACL.

[6]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[7]  이주연,et al.  Latent Dirichlet Allocation (LDA) 모델 기반의 인공지능(A.I.) 기술 관련 연구 활동 및 동향 분석 , 2018 .

[8]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[9]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[10]  Yelong Shen,et al.  End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture , 2015, NIPS.

[11]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[12]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[13]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[14]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[15]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[16]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[17]  S. Amari Differential Geometry of Curved Exponential Families-Curvatures and Information Loss , 1982 .

[18]  Georgios Paliouras,et al.  LSHTC: A Benchmark for Large-Scale Text Classification , 2015, ArXiv.

[19]  Fei Li,et al.  A fast and scalable supervised topic model using stochastic variational inference and MapReduce , 2016, 2016 IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC).

[20]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[24]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[25]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[26]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[27]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[28]  Martin Wainwright,et al.  Learning in graphical models: Missing data and rigorous guarantees with non-convexity , 2011 .

[29]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[30]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[31]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.