When Does Cotraining Work in Real Data?

Cotraining, a paradigm of semisupervised learning, is promised to alleviate effectively the shortage of labeled examples in supervised learning. The standard two-view cotraining requires the data set to be described by two views of features, and previous studies have shown that cotraining works well if the two views satisfy the sufficiency and independence assumptions. In practice, however, these two assumptions are often not known or ensured (even when the two views are given). More commonly, most supervised data sets are described by one set of attributes (one view). Thus, they need be split into two views in order to apply the standard two-view cotraining. In this paper, we first propose a novel approach to empirically verify the two assumptions of cotraining given two views. Then, we design several methods to split single view data sets into two views, in order to make cotraining work reliably well. Our empirical results show that, given a whole or a large labeled training set, our view verification and splitting methods are quite effective. Unfortunately, cotraining is called for precisely when the labeled training set is small. However, given small labeled training sets, we show that the two cotraining assumptions are difficult to verify, and view splitting is unreliable. Our conclusions for cotraining's effectiveness are mixed. If two views are given, and known to satisfy the two assumptions, cotraining works well. Otherwise, based on small labeled training sets, verifying the assumptions or splitting single view into two views are unreliable; thus, it is uncertain whether the standard cotraining would work or not.

[1]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[2]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[3]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[4]  Naonori Ueda,et al.  Semisupervised Learning for a Hybrid Generative/Discriminative Classifier based on the Maximum Entropy Principle , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[9]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[10]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[11]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[12]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[13]  Anoop Sarkar,et al.  Corrected Co-training for Statistical Parsers , 2003 .

[14]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[15]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[16]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[17]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[18]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[19]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Zhi-Hua Zhou,et al.  Exploiting Unlabeled Data in Content-Based Image Retrieval , 2004, ECML.

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[24]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[25]  Jun Du,et al.  When does Co-training Work in Real Data? , 2009, PAKDD.

[26]  Zhi-Hua Zhou,et al.  Analyzing Co-training Style Algorithms , 2007, ECML.

[27]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[28]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[29]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[30]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[31]  Zhi-Hua Zhou,et al.  Enhancing relevance feedback in image retrieval using unlabeled data , 2006, ACM Trans. Inf. Syst..

[32]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[33]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.