A probabilistic approach towards an unbiased semi-supervised cluster tree

Abstract Conventionally, it is a prerequisite to acquire a good number of annotated data to train an accurate classifier. However, the acquisition of such dataset is usually infeasible due to the high annotation cost. Therefore, semi-supervised learning has emerged and attracts increasing research efforts in recent years. Essentially, semi-supervised learning is sensitive to the manner how the unlabeled data is sampled. However, the model performance might be seriously deteriorated if biased unlabeled data is sampled at the early stage. In this paper, an unbiased semi-supervised cluster tree is proposed which is learnt using only very few labeled data. Specifically, a K-means algorithm is adopted to build each level of this hierarchical tree in a decent top-down manner. The number of clusters is determined by the number of classes contained in the labeled data. The confidence error of the cluster tree is theoretically analyzed which is then used to prune the tree. Empirical studies on several datasets have demonstrated that the proposed semi-supervised cluster tree is superior to the state-of-the-art semi-supervised learning algorithms with respect to classification accuracy.

[1]  Hamido Fujita,et al.  Multi-Imbalance: An open-source software for multi-class imbalance learning , 2019, Knowl. Based Syst..

[2]  Carey E. Priebe,et al.  The Effect of Model Misspecification on Semi-Supervised Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Trevor Darrell,et al.  Co-training with noisy perceptual observations , 2009, CVPR.

[4]  Alain Biem,et al.  Semisupervised Least Squares Support Vector Machine , 2009, IEEE Transactions on Neural Networks.

[5]  Mikhail Belkin,et al.  Semi-Supervised Learning , 2021, Machine Learning.

[6]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[7]  Yunming Ye,et al.  Subspace Model Based Discriminative Instances Selection for Weakly Supervised Object Detection , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[8]  Zhi-Hua Zhou,et al.  Tri-net for Semi-Supervised Deep Learning , 2018, IJCAI.

[9]  Nello Cristianini,et al.  Convex Methods for Transduction , 2003, NIPS.

[10]  Michele Dalponte,et al.  Semi-supervised SVM for individual tree crown species classification , 2015 .

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Wenchao Xiao,et al.  Semi-supervised hierarchical clustering ensemble and its application , 2016, Neurocomputing.

[13]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[14]  Zhi Juan Jia,et al.  Application of TSVM Incremental Learning in Web Text Categorization , 2010 .

[15]  Yunming Ye,et al.  Mining from distributed and abstracted data , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[16]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  Tao Li,et al.  Semi-supervised Hierarchical Clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[20]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[21]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Yunming Ye,et al.  Learning Discriminative Subspace Models for Weakly Supervised Face Detection , 2017, IEEE Transactions on Industrial Informatics.

[23]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[24]  Patrick Fox-Roberts,et al.  Unbiased generative semi-supervised learning , 2014, J. Mach. Learn. Res..

[25]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[26]  Mohan S. Kankanhalli,et al.  Semi-Supervised Learning for Surface EMG-based Gesture Recognition , 2017, IJCAI.

[27]  Jianyi Guo,et al.  Question classification based on co-training style semi-supervised learning , 2010, Pattern Recognit. Lett..

[28]  Qinghua Zheng,et al.  Adaptive Semi-Supervised Learning with Discriminative Least Squares Regression , 2017, IJCAI.

[29]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[30]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[31]  Yong Luo,et al.  Manifold Regularized Multitask Learning for Semi-Supervised Multilabel Image Classification , 2013, IEEE Transactions on Image Processing.

[32]  Ying Tan,et al.  Variational Autoencoder for Semi-Supervised Text Classification , 2017, AAAI.

[33]  Ling Chen,et al.  A Refinement Approach to Handling Model Misfit in Semi-supervised Learning , 2010, ADMA.

[34]  Qiang Yang,et al.  Semi-Supervised Learning with Very Few Labeled Training Examples , 2007, AAAI.

[35]  Zhi-Hua Zhou,et al.  When semi-supervised learning meets ensemble learning , 2009, MCS.

[36]  Ioannis A. Maraziotis,et al.  A semi-supervised fuzzy clustering algorithm applied to gene expression data , 2012, Pattern Recognit..

[37]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[38]  José Miguel Hernández-Lobato,et al.  Bayesian Semisupervised Learning with Deep Generative Models , 2017, 1706.09751.

[39]  Matthias Seeger,et al.  Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.

[40]  Lidong Bing,et al.  Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning , 2013, WSDM.

[41]  Naonori Ueda,et al.  Deterministic Annealing Variant of the EM Algorithm , 1994, NIPS.

[42]  Steffen Bickel,et al.  Estimation of Mixture Models Using Co-EM , 2005, ECML.

[43]  Hong Chang,et al.  Relaxational metric adaptation and its application to semi-supervised clustering and content-based image retrieval , 2006, Pattern Recognit..

[44]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[45]  Jong-Hoon Oh,et al.  A Semi-Supervised Learning Approach to Why-Question Answering , 2016, AAAI.

[46]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.