Active Learning and Basis Selection for Kernel-Based Linear Models: A Bayesian Perspective

We develop an active learning algorithm for kernel-based linear regression and classification. The proposed greedy algorithm employs a minimum-entropy criterion derived using a Bayesian interpretation of ridge regression. We assume access to a matrix, <i>? ? \BBRN</i>?<i>N</i>, for which the <i>(i</i>,<i>j</i>)th element is defined by the kernel function <i>K</i>(?<i>i</i>,?<i>j</i>) ? \BBR, with the observed data <i>?i</i> ? \BBR<i>d</i>. We seek a model, <i>M</i>:?<i>i</i>? <i>yi</i>, where <i>yi</i> is a real-valued response or integer-valued label, which we do not have access to <i>a priori</i>. To achieve this goal, a submatrix, <i>?Il</i>,<i>Ib</i> ? \BBR<i>n</i>?<i>m</i>, is sought that corresponds to the intersection of <i>n</i> rows and <i>m</i> columns of <i>?</i>, indexed by the sets <i>Il</i> and <i>Ib</i>, respectively. Typically <i>m</i> ? <i>N</i> and <i>n</i> ? <i>N</i>. We have two objectives: <i>(i</i>) Determine the <i>m</i> columns of <i>?</i>, indexed by the set <i>Ib</i>, that are the most informative for building a linear model, <i>M</i>: [1 ?<i>i</i>,<i>Ib</i>]<i>T</i> ? <i>yi</i> , without any knowledge of <i>{yi</i>}<i>i</i>=1<i>N</i> and <i>(ii</i>) using active learning, sequentially determine which subset of <i>n</i> elements of <i>{yi</i>}<i>i</i>=1<i>N</i> should be acquired; both stopping values, <i>|Ib</i>| = <i>m</i> and <i>|Il</i>| = <i>n</i>, are also to be inferred from the data. These steps are taken with the goal of minimizing the uncertainty of the model parameters, <i>x</i>, as measured by the differential entropy of its posterior distribution. The parameter vector <i>x</i> ? \BBR<i>m</i>, as well as the model bias <i>? ? \BBR</i>, is then learned from the resulting problem, <i>yIl</i> = ?<i>Il</i>,<i>Ibx</i> + ?<b>1</b>+?. The remaining <i>N</i>-<i>n</i> responses/labels not included in <i>yIl</i> can be inferred by applying <i>x</i> to the remaining <i>N</i>-<i>n</i> rows of <i>?</i> <sub>:,</sub> <i>Ib</i>. We show experimental results for several regression and classification problems, and compare to other active learning methods.

[1]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[2]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[3]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[4]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[5]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[6]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[9]  Matthias W. Seeger,et al.  Compressed sensing and Bayesian experimental design , 2008, ICML '08.

[10]  Richard G. Baraniuk,et al.  Random Projections of Smooth Manifolds , 2009, Found. Comput. Math..

[11]  Lawrence Carin,et al.  Plan-In-Advance Active Learning 0f Classifiers , 2008 .

[12]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[13]  Richard G. Baraniuk,et al.  Random Projections of Signal Manifolds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[15]  Chuan-Sheng Foo,et al.  A majorization-minimization algorithm for (multiple) hyperparameter learning , 2009, ICML '09.

[16]  Lawrence Carin,et al.  Active selection of labeled data for target detection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[19]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[20]  Lawrence Carin,et al.  Detection of buried targets via active selection of labeled data: application to sensing subsurface UXO , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[21]  J. Tropp,et al.  SIGNAL RECOVERY FROM PARTIAL INFORMATION VIA ORTHOGONAL MATCHING PURSUIT , 2005 .

[22]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[23]  J. H. Schuenemeyer,et al.  Generalized Linear Models (2nd ed.) , 1992 .

[24]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[25]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  P. Grassberger,et al.  Measuring the Strangeness of Strange Attractors , 1983 .

[28]  Lawrence Carin,et al.  Application of the theory of optimal experiments to adaptive electromagnetic-induction sensing of buried targets , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[30]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[31]  Chinmay Hegde,et al.  Random Projections for Manifold Learning , 2007, NIPS.

[32]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[33]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[34]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[35]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.