Incremental learning with partial-supervision based on hierarchical Dirichlet process and the application for document classification

Partial-supervision: use available knowledge to guide model learning process for better accuracy.Incremental learning: adjust parameters and model structure to the latest information.Introduce granular computing idea to achieve better accuracy and detect new emergent categories. Hierarchical Dirichlet process (HDP) is an unsupervised method which has been widely used for topic extraction and document clustering problems. One advantage of HDP is that it has an inherent mechanism to determine the total number of clusters/topics. However, HDP has three weaknesses: (1) there is no mechanism to use known labels or incorporate expert knowledge into the learning procedure, thus precluding users from directing the learning and making the final results incomprehensible; (2) it cannot detect the categories expected by applications without expert guidance; (3) it does not automatically adjust the model parameters and structure in a changing environment. To address these weaknesses, this paper proposes an incremental learning method, with partial supervision for HDP, which enables the topic model (initially guided by partial knowledge) to incrementally adapt to the latest available information. An important contribution of this work is the application of granular computing to HDP for partial-supervision and incremental learning which results in a more controllable and interpretable model structure. These enhancements provide a more flexible approach with expert guidance for the model learning and hence results in better prediction accuracy and interpretability.

[1]  Florin Pop,et al.  Intelligent Web-History Based on a Hybrid Clustering Algorithm for Future-Internet Systems , 2011, 2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[2]  Hongfei Yan,et al.  SSHLDA: A Semi-Supervised Hierarchical Topic Model , 2012, EMNLP.

[3]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[4]  Jie Liu,et al.  Hierarchical Latent Dirichlet Allocation models for realistic action recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Muhammad Rafi,et al.  Comparing SVM and naïve Bayes classifiers for text categorization with Wikitology as knowledge enrichment , 2011, 2011 IEEE 14th International Multitopic Conference.

[6]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Fabio Stella,et al.  A Software System for Topic Extraction and Document Classification , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Xiaoqing Zhang,et al.  Mining Hot Topics from Free-Text Customer Reviews An LDA-Based Approach , 2010, 2010 Seventh Web Information Systems and Applications Conference.

[10]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Chong Wang,et al.  Variational Inference for the Nested Chinese Restaurant Process , 2009, NIPS.

[12]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[13]  Haixun Wang,et al.  Tracking and Connecting Topics via Incremental Hierarchical Dirichlet Processes , 2011, 2011 IEEE 11th International Conference on Data Mining.

[14]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[15]  Di Wang,et al.  Semi-Supervised Latent Dirichlet Allocation and Its Application for Document Classification , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[16]  George A. Vouros,et al.  Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes , 2011, J. Mach. Learn. Res..

[17]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[18]  Sunil Arya,et al.  Space-time tradeoffs for approximate nearest neighbor searching , 2009, JACM.

[19]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[20]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  David B. Dunson,et al.  A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation , 2009, NIPS.

[22]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[23]  M. Karthikeyan,et al.  Probability based document clustering and image clustering using content-based image retrieval , 2013, Appl. Soft Comput..