Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions

We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear discriminant classifiers, focusing on the case when there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm.

[1]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[2]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[4]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[5]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[6]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[7]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[8]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[9]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[10]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[11]  Theodor Mader,et al.  Feature Selection with the CLOP Package , 2006 .

[12]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[13]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[14]  Lior Rokach,et al.  Random Projection Ensemble Classifiers , 2009, ICEIS.

[15]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[16]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[17]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Ata Kabán,et al.  Compressed fisher linear discriminant analysis: classification of randomly projected data , 2010, KDD.

[20]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[21]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[24]  Van H. Vu Singular vectors under random perturbation , 2011, Random Struct. Algorithms.

[25]  Fabio Roli,et al.  A Theoretical Analysis of Bagging as a Linear Combination of Classifiers , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[27]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  David G. Stork,et al.  Pattern Classification , 1973 .

[29]  Robert P. W. Duin,et al.  Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , 1998, Pattern Recognit. Lett..

[30]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[31]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[32]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  J. Matousek,et al.  On variants of the Johnson–Lindenstrauss lemma , 2008 .

[35]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[36]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[37]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[38]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[39]  Ata Kabán,et al.  Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions , 2013, ACML.

[40]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[41]  Abdelaziz Rhandi,et al.  Gaussian Measures on Separable Hilbert Spaces and Applications , 2004 .

[42]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[43]  Vincenzo Piuri,et al.  Ensembles based on Random Projection for gene expression data analysis. , 2006 .

[44]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[45]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[46]  Thomas L. Marzetta,et al.  A Random Matrix-Theoretic Approach to Handling Singular Covariance Estimates , 2011, IEEE Transactions on Information Theory.

[47]  R. Folgieri Ensembles based on random projection for gene expression data analysis , 2008 .

[48]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[49]  David C. Hoyle,et al.  Accuracy of Pseudo-Inverse Covariance Learning—A Random Matrix Theory Analysis , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[51]  Gene H. Golub,et al.  Matrix computations , 1983 .

[52]  HoTin Kam The Random Subspace Method for Constructing Decision Forests , 1998 .

[53]  Antonia Maria Tulino,et al.  Random Matrix Theory and Wireless Communications , 2004, Found. Trends Commun. Inf. Theory.

[54]  Don Gossink,et al.  Misclassification Probability Bounds for Multivariate Gaussian Classes , 1999, Digit. Signal Process..

[55]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[56]  Jing Lei,et al.  Minimax Rates of Estimation for Sparse PCA in High Dimensions , 2012, AISTATS.

[57]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[58]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[59]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[60]  George Bebis,et al.  Face recognition experiments with random projection , 2005, SPIE Defense + Commercial Sensing.

[61]  R. Penrose A Generalized inverse for matrices , 1955 .

[62]  I. Johnstone,et al.  Augmented sparse principal component analysis for high dimensional data , 2012, 1202.1242.

[63]  Robert J. Durrant,et al.  Learning in high dimensions with projected linear discriminants , 2013 .

[64]  Ata Kabán,et al.  Error bounds for Kernel Fisher Linear Discriminant in Gaussian Hilbert space , 2012, AISTATS.

[65]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.