Discriminative models and dimensionality reduction for regression

Many prediction problems that arise in computer vision and robotics can be formulated within a regression framework. Unlike traditional regression problems, vision and robotics tasks are often characterized by a varying number of output variables with complex dependency structures. The problems are further aggravated by the high dimensionality of the input. In this thesis, I address two challenging tasks related to learning of regressors in such settings: (1) developing discriminative approaches that can handle structured output variables, and (2) reducing the dimensionality of the input while preserving the statistical correlation with the output. A complex dependency structure in the output variables can be effectively captured by probabilistic graphical models. In contrast to traditional joint data modeling for probabilistic models, I propose conditional models and a discriminative learning approach that are directly related to the ultimate prediction objective. While discriminative learning of structured models such as Conditional Random Fields (CRFs) has attracted significant interest in the past, learning structured models in the regression setting has been rarely explored. In this work I first extend the CRF and the discriminatively trained HMM methods to the structured output regression problem. I propose two different approaches based on directed and undirected models. In the second approach the parameter learning is cast as a convex optimization problem, accompanied by a new approach that effective handles the density integrability constraint. Experiments in several problem domains, including human motion and robot-arm state estimation, indicate that the new models yield high prediction accuracy comparable to or better than state-of-the-art approaches. In the second part of the thesis, I consider the task of finding a low-dimensional representation of the input covariates while preserving the statistical correlation in regressing the output. This task, known as the dimensionality reduction for regression (DRR), is particularly useful when visualizing high-dimensional data, efficiently designing regressors with a reduced input dimension, and eliminating noise in the input data by uncovering essential information for predicting the output. While the dimensionality reduction methods are common in many machine learning tasks, their use in the regression settings has not been widespread. A number of recent methods for DRR have been proposed in the statistics community but suffer from several limitations, including non-convexity and the need for slicing of potentially high-dimensional output space. I address these issues by proposing a novel approach based on covariance operators in reproducing kernel Hilbert spaces (RKHSes) that provide a closed-form DRR solution without the need for explicit slicing. The benefits of this approach are demonstrated in a comprehensive set of evaluations on several important regression problems in computer vision and pattern recognition.

[1]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[2]  Michael I. Jordan,et al.  Regression on manifolds using kernel dimension reduction , 2007, ICML '07.

[3]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[4]  Rui Li,et al.  Articulated Pose Estimation in a Learned Smooth Space of Feasible Solutions , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[5]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[6]  Richard S. Zemel,et al.  Combining discriminative features to infer complex trajectories , 2006, ICML.

[7]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[8]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[9]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[10]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11]  H. Akaike A new look at the statistical model identification , 1974 .

[12]  Martial Hebert,et al.  Exploiting Inference for Approximate Parameter Learning in Discriminative Fields: An Empirical Study , 2005, EMMCVPR.

[13]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[14]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[15]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[18]  Trevor Darrell,et al.  Conditional Random People: Tracking Humans with CRFs and Grid Filters , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  A. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, CVPR 2004.

[20]  Zoubin Ghahramani,et al.  Learning Nonlinear Dynamical Systems Using an EM Algorithm , 1998, NIPS.

[21]  N. Vakhania,et al.  Probability Distributions on Banach Spaces , 1987 .

[22]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[23]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[24]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[25]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  A. Ran,et al.  Necessary and sufficient conditions for the existence of a positive definite solution of the matrix equation X + A*X-1A = Q , 1993 .

[27]  Vladimir Pavlovic,et al.  Discriminative Learning of Dynamical Systems for Motion Tracking , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cristian Sminchisescu,et al.  Conditional Visual Tracking in Kernel Space , 2005, NIPS.

[29]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[30]  David J. Fleet,et al.  Priors for people tracking from small training sets , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[31]  Michael Isard,et al.  Learning and Classification of Complex Dynamics , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Yee Whye Teh,et al.  An Alternate Objective Function for Markovian Fields , 2002, ICML.

[33]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[34]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[35]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[36]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[37]  Franz Pernkopf,et al.  Discriminative versus generative parameter and structure learning of Bayesian network classifiers , 2005, ICML.

[38]  David J. Fleet,et al.  Robust Online Appearance Models for Visual Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Vladimir Pavlovic,et al.  Efficient discriminative learning of Bayesian network classifier via boosted augmented naive Bayes , 2005, ICML '05.

[40]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[41]  David J. Fleet,et al.  3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[42]  Han-Ming Wu Kernel Sliced Inverse Regression with Applications to Classification , 2008 .

[43]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[44]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[45]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[46]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[47]  R. Cook Regression Graphics , 1994 .

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  Martial Hebert,et al.  Discriminative Random Fields , 2006, International Journal of Computer Vision.

[50]  Cristian Sminchisescu,et al.  Generative modeling for continuous non-linearly embedded visual inference , 2004, ICML.

[51]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[52]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[53]  Shuicheng Yan,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007 .

[54]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[55]  Qiang Wang,et al.  Learning object intrinsic structure for robust visual tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[56]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Vladimir Pavlovic,et al.  Discriminative Learning of Mixture of Bayesian Network Classifiers for Sequence Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[58]  Bin Shen,et al.  Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers , 2002, Machine Learning.

[59]  C. Baker Joint measures and cross-covariance operators , 1973 .

[60]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[61]  Rudolph van der Merwe,et al.  The square-root unscented Kalman filter for state and parameter-estimation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[62]  J. Engwerda On the existence of a positive definite solution of the matrix equation X + A , 1993 .

[63]  Cristian Sminchisescu,et al.  Discriminative density propagation for 3D human motion estimation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[64]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[65]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[66]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[67]  Trevor Darrell,et al.  Learning appearance manifolds from video , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).