论文信息 - Co-Training and Expansion: Towards Bridging Theory and Practice

Co-Training and Expansion: Towards Bridging Theory and Practice

Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker "expansion" assumption on the underlying data distribution, that we prove is sufficient for iterative co-training to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs.

[1] Rayid Ghani,et al. Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[2] M. Kearns. Learning Boolean Formulae , 2022 .

[3] Byoung-Tak Zhang,et al. Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information , 2003, PAKDD.

[4] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[5] Leslie G. Valiant,et al. Learning Boolean formulas , 1994, JACM.

[6] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[7] Zoubin Ghahramani,et al. Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[8] Sanjoy Dasgupta,et al. PAC Generalization Bounds for Co-training , 2001, NIPS.

[9] Claire Cardie,et al. Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[10] Steven P. Abney,et al. Bootstrapping , 2002, ACL.

[11] Ronald L. Rivest,et al. Learning complicated concepts reliably and usefully , 1988, Annual Conference Computational Learning Theory.

[12] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[13] Paul A. Viola,et al. Unsupervised improvement of visual detectors using cotraining , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[15] Rajeev Motwani,et al. Randomized Algorithms , 1995, SIGA.

[16] Andrew G. Clark,et al. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , 2002 .

[17] Rayid Ghani,et al. Combining labeled and unlabeled data for text classification with a large number of categories , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[18] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.