D-LDA: A Topic Modeling Approach without Constraint Generation for Semi-defined Classification

We study what we call semi-defined classification, which deals with the categorization tasks where the taxonomy of the data is not well defined in advance. It is motivated by the real-world applications, where the unlabeled data may also come from some other unknown classes besides the known classes for the labeled data. Given the unlabeled data, our goal is to not only identify the instances belonging to the known classes, but also cluster the remaining data into other meaningful groups. It differs from traditional semi-supervised clustering in the sense that in semi-supervised clustering the supervision knowledge is too far from being representative of a target classification, while in semi-defined classification the labeled data may be enough to supervise the learning on the known classes. In this paper we propose the model of Double-latent-layered LDA (D-LDA for short) for this problem. Compared with LDA with only one latent variable y for word topics, D-LDA contains another latent variable z for (known and unknown) document classes. With this double latent layers consisting of y and z and the dependency between them, D-LDA directly injects the class labels into z to supervise the exploiting of word topics in y. Thus, the semi-supervised learning in D-LDA does not need the generation of pair wise constraints, which is required in most of the previous semi-supervised clustering approaches. We present the experimental results on ten different data sets for semi-defined classification. Our results are either comparable to (on one data sets), or significantly better (on the other nine data set) than the six compared methods, including the state-of-the-art semi-supervised clustering methods.

[1]  Miguel Á. Carreira-Perpiñán,et al.  Constrained spectral clustering through affinity propagation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[6]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[7]  Seungjin Choi,et al.  Semi-Supervised Nonnegative Matrix Factorization , 2010, IEEE Signal Processing Letters.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[10]  Hui Xiong,et al.  Enhancing semi-supervised clustering: a feature projection perspective , 2007, KDD '07.

[11]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[12]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[13]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[14]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[15]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[16]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[17]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.