Metric-Based Approaches for Semi-Supervised Regression and Classification

Semi-supervised learning methods typically require an explicit relationship to be asserted between the labeled and unlabeled data—as illustrated, for example, by the neighbourhoods used in graph-based methods. Semi-supervised model selection and regularization methods are presented here that instead require only that the labeled and unlabeled data are drawn from the same distribution. From this assumption, a metric can be constructed over hypotheses based on their predictions for unlabeled data. This metric can then be used to detect untrustworthy training error estimates, leading to model selection strategies that select the richest hypothesis class while providing theoretical guarantees against over-fitting. This general approach is then adapted to regularization for supervised regression and supervised classification with probabilistic classifiers. The regularization adapts not only to the hypothesis class but also to the specific data sample provided, allowing for better performance than regularizers that account only for class complexity.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[4]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[5]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[6]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[7]  John Langford,et al.  A comparison of tight generalization error bounds , 2005, ICML '05.

[8]  Dale Schuurmans,et al.  Metric-Based Methods for Adaptive Model Selection and Regularization , 2002, Machine Learning.

[9]  Matti Kääriäinen,et al.  Generalization Error Bounds Using Unlabeled Data , 2005, COLT.

[10]  Cullen Schaffer Overfitting avoidance as bias , 2004, Machine Learning.

[11]  Nicolas Chapados,et al.  Extensions to Metric-Based Model Selection , 2003, J. Mach. Learn. Res..

[12]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[13]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[14]  R. Shibata An optimal selection of regression variables , 1981 .

[15]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[18]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[19]  Dale Schuurmans,et al.  Characterizing the generalization performance of model selection strategies , 1997, ICML.