An Information Theoretic Framework for Multi-view Learning

In the multi-view learning paradigm, the input variable is partitioned into two different views X1 and X2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semisupervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known — roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful coregularization algorithms fall into our framework and provide performance bounds (using the results in Rosenberg and Bartlett [2007], Farquhar et al. [2005]). We also provide a normative justification for canonical correlation analysis (CCA) as a dimensionality reduction technique. In particular, we show (for strictly convex loss functions of the form `(w·x, y)) that we can first use CCA as dimensionality reduction technique and (if the multi-view assumption is satisfied) this projection does not throw away much predictive information about the target Y — the benefit being that subsequent learning with a labeled set need only work in this lower dimensional space.

[1]  H. Hotelling The most predictable criterion. , 1935 .

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[4]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[5]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[6]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[7]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[8]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[9]  Steven P. Abney Understanding the Yarowsky Algorithm , 2004, CL.

[10]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[11]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[12]  Maria-Florina Balcan,et al.  A PAC-Style Model for Learning from Labeled and Unlabeled Data , 2005, COLT.

[13]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[14]  Thomas Gärtner,et al.  Efficient co-regularised least squares regression , 2006, ICML.

[15]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[16]  Maria-Florina Balcan,et al.  Open Problems in Efficient Semi-supervised PAC Learning , 2007, COLT.

[17]  The rademacher complexity of coregularized kernel classes , 2007 .

[18]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[19]  Peter L. Bartlett,et al.  The Rademacher Complexity of Co-Regularized Kernel Classes , 2007, AISTATS.