Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic Modelling

In real world machine learning applications, testing data may contain some meaningful new categories that have not been seen in labeled training data. To simultaneously recognize new data categories and assign most appropriate category labels to the data actually from known categories, existing models assume the number of unknown new categories is pre-specified, though it is difficult to determine in advance. In this paper, we propose a Bayesian nonparametric topic model to automatically infer this number, based on the hierarchical Dirichlet process and the notion of latent Dirichlet allocation. Exact inference in our model is intractable, so we provide an efficient collapsed Gibbs sampling algorithm for approximate posterior inference. Extensive experiments on various text data sets show that: a compared with parametric approaches that use pre-specified true number of new categories, the proposed nonparametric approach can yield comparable performance; and b when the exact number of new categories is unavailable, i.e. the parametric approaches only have a rough idea about the new categories, our approach has evident performance advantages.

[1]  Hui Xiong,et al.  Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Matthias Hein,et al.  Constrained 1-Spectral Clustering , 2012, AISTATS.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  See-Kiong Ng,et al.  Learning to Identify Unexpected Instances in the Test Set , 2007, IJCAI.

[7]  Murat Dundar,et al.  A machine‐learning approach to detecting unknown bacterial serovars , 2010, Stat. Anal. Data Min..

[8]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[9]  Gang Niu,et al.  Analysis of Learning from Positive and Unlabeled Data , 2014, NIPS.

[10]  Matthias Hein,et al.  Spectral clustering based on the graph p-Laplacian , 2009, ICML '09.

[11]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[12]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[13]  Murat Dundar,et al.  Bayesian Nonexhaustive Learning for Online Discovery and Modeling of Emerging Classes , 2012, ICML.

[14]  Zoubin Ghahramani,et al.  Determinantal Clustering Processes - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering , 2013, UAI.

[15]  Zoubin Ghahramani,et al.  Determinantal clustering process - a nonparametric Bayesian approach to kernel based semi-supervised clustering , 2013, UAI 2013.

[16]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[17]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[18]  Michael I. Jordan,et al.  Generalized Zero-Shot Learning with Deep Calibration Network , 2018, NeurIPS.

[19]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[20]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[21]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[22]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[23]  Yuhong Xiong,et al.  Erratum to "Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification" , 2012, IEEE Trans. Knowl. Data Eng..

[24]  Ning Chen,et al.  Gibbs max-margin topic models with data augmentation , 2013, J. Mach. Learn. Res..

[25]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[26]  Luo Si,et al.  Serendipitous learning: learning beyond the predefined label space , 2011, KDD.

[27]  Fuzhen Zhuang,et al.  Multi-task Semi-supervised Semantic Feature Learning for Classification , 2012, 2012 IEEE 12th International Conference on Data Mining.

[28]  Fuzhen Zhuang,et al.  D-LDA: A Topic Modeling Approach without Constraint Generation for Semi-defined Classification , 2010, 2010 IEEE International Conference on Data Mining.

[29]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[30]  M. Cugmas,et al.  On comparing partitions , 2015 .

[31]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..