Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces

We propose a novel method of dimensionality reduction for supervised learning problems. Given a regression or classification problem in which we wish to predict a response variable Y from an explanatory variable X, we treat the problem of dimensionality reduction as that of finding a low-dimensional "effective subspace" for X which retains the statistical relationship between X and Y. We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem we establish a general nonparametric characterization of conditional independence using covariance operators on reproducing kernel Hilbert spaces. This characterization allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods for dimensionality reduction in supervised learning, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y. We present experiments that compare the performance of the method with conventional methods.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  W. Strawderman Proper Bayes Minimax Estimators of the Multivariate Normal Mean , 1971 .

[3]  C. Baker Joint measures and cross-covariance operators , 1973 .

[4]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[5]  J. Berger A Robust Generalized Bayes Estimator and Confidence Region for a Multivariate Normal Mean , 1980 .

[6]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[7]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[8]  N. Vakhania,et al.  Probability Distributions on Banach Spaces , 1987 .

[9]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[10]  I. Helland ON THE STRUCTURE OF PARTIAL LEAST SQUARES REGRESSION , 1988 .

[11]  A. Höskuldsson PLS regression methods , 1988 .

[12]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[13]  Nicholas G. Polson A representation of the posterior mean for a location model , 1991 .

[14]  S. Weisberg,et al.  Comments on "Sliced inverse regression for dimension reduction" by K. C. Li , 1991 .

[15]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[16]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[17]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[18]  Adrian F. M. Smith,et al.  Exact and Approximate Posterior Moments for a Normal Location Parameter , 1992 .

[19]  A. Samarov Exploring Regression Structure Using Nonparametric Functional Estimation , 1993 .

[20]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[21]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[22]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[23]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Otis W. Gilley,et al.  On the Harrison and Rubinfeld Data , 1996 .

[26]  Christopher K. I. Williams,et al.  Discovering Hidden Features with Gaussian Processes Regression , 1998, NIPS.

[27]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[28]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[29]  R. Cook,et al.  Dimension Reduction in Binary Response Regression , 1999 .

[30]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[31]  Ker-Chau Li,et al.  Interactive Tree-Structured Regression via Principal Hessian Directions , 2000 .

[32]  J. Polzehl,et al.  Structure adaptive approach for dimension reduction , 2001 .

[33]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[34]  R. Cook,et al.  Theory & Methods: Special Invited Paper: Dimension Reduction and Visualization in Discriminant Analysis (with discussion) , 2001 .

[35]  S. Weisberg Dimension Reduction Regression in R , 2002 .

[36]  W. Fung,et al.  DIMENSION REDUCTION BASED ON CANONICAL CORRELATION , 2002 .

[37]  Michael I. Jordan,et al.  Learning Graphical Models with Mercer Kernels , 2002, NIPS.

[38]  Michael I. Jordan,et al.  Tree-dependent Component Analysis , 2002, UAI.

[39]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[40]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[41]  D. Alpay The Schur algorithm, reproducing kernel spaces and system theory , 2002 .

[42]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[43]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[44]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[45]  Michael I. Jordan,et al.  Beyond Independent Components: Trees and Clusters , 2003, J. Mach. Learn. Res..

[46]  Michael A. West,et al.  Archival Version including Appendicies : Experiments in Stochastic Computation for High-Dimensional Graphical Models , 2005 .

[47]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[48]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.