Multi-LDA hybrid topic model with boosting strategy and its application in text classification

Topic modeling, especially Latent Dirichlet Allocation is an efficacious algorithm for feature selection and dimension reduction in text categorization tasks. Unlike the traditional Vector Space Model, LDA can easily overcome the curse of dimensionality and feature sparse problems. With the mapping from word space to the topic space, there are more benefits, but at the same time, the determination of model parameters turn into a new trouble. This article proposed a novel classification algorithm that combined different models with different parameters together via boosting strategy. Moreover, Naïve Bayes and Support Vector Machine are employed as weak classifier and a weighted method is proposed for improving the accuracy by integrating weak classifiers into strong classifier in a more ration way. Experiment results show our method well perform both in accuracy and generalization.

[1]  Qiao Guo,et al.  LDA boost classification: boosting by topics , 2012, EURASIP J. Adv. Signal Process..

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[4]  Jiri Matas,et al.  Automatic Hardware Implementation Tool for a Discrete Adaboost-Based Decision Algorithm , 2005, EURASIP J. Adv. Signal Process..

[5]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[6]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[7]  Foster J. Provost,et al.  Aggregation-based feature invention and relational concept classes , 2003, KDD '03.

[8]  Dirk Thorleuchter,et al.  Mining ideas from textual information , 2010, Expert Syst. Appl..