Out-of-Sample Error Estimation: The Blessing of High Dimensionality

Dealing with high dimensionality when learning from data is a tough task since, for example, similarity and correlation in data cannot be properly captured by the conventional notions of distance. Issues are amplified whenever coping with small sample problems, i.e. When the cardinality of the dataset is remarkably smaller than its dimensionality: in these cases, a reliable estimation of the accuracy of the trained model on new data is difficult to derive because of the inefficiency of standard statistical inference approaches in this framework. In this paper, we show that high dimensionality of data, at least under some assumptions, helps improving the assessment of the performance of a model, trained with empirical data in supervised classification tasks. In particular, we propose to create copies of the original dataset, where, however, only subsets of independent and informative features are considered in turn: we show that training and combining a collection of classifiers on these sets help filling the gap between the true and the estimated error of the models. In order to verify the potentiality of the proposed approach and to get more insights on it, we test the method on both an artificial problem and on a series of real-world high dimensional Human Gene Expression datasets.

[1]  David Page Comparative Data Mining for Microarrays : A Case Study Based on Multiple Myeloma , 2002 .

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Michael E. Houle,et al.  Dimensionality, Discriminability, Density and Distance Distributions , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[5]  Ata Kabán,et al.  Non-parametric detection of meaningless distances in high dimensional data , 2011, Statistics and Computing.

[6]  Graziano Pesole,et al.  On the statistical assessment of classifiers using DNA microarray data , 2006, BMC Bioinformatics.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[9]  Alexander Statnikov,et al.  A comprehensive evaluation of multicategory classification methods for microbiomic data , 2013, Microbiome.

[10]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[11]  Mikhail Belkin,et al.  The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures , 2013, COLT.

[12]  Peng Jiang,et al.  Pattern Discovery in High Dimensional Binary Data , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[13]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[14]  Geoffrey I. Webb,et al.  Scaling Log-Linear Analysis to High-Dimensional Data , 2013, 2013 IEEE 13th International Conference on Data Mining.

[15]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[16]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[17]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Davide Anguita,et al.  Test error bounds for classifiers: A survey of old and new results , 2011, 2011 IEEE Symposium on Foundations of Computational Intelligence (FOCI).

[19]  H. White,et al.  A Reality Check for Data Snooping , 2000 .

[20]  Ata Kabán A New Look at Compressed Ordinary Least Squares , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[21]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[22]  M. Eisen,et al.  Gene expression informatics —it's all in your mine , 1999, Nature Genetics.

[23]  Terrence J. Sejnowski,et al.  On Criticality in High-Dimensional Data , 2014, Neural Computation.

[24]  Mikhail Belkin,et al.  Towards a theoretical foundation for Laplacian-based manifold methods , 2005, J. Comput. Syst. Sci..

[25]  Zhi-pei Liang,et al.  Introduction to biomedical imaging , 2008 .

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  David D. Jensen Data snooping, dredging and fishing: the dark side of data mining a SIGKDD99 panel report , 2000, SKDD.

[28]  Zongben Xu,et al.  Sparse K-Means with the l_q(0leq q< 1) Constraint for High-Dimensional Data Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[29]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  A. Ultsch Maps for the Visualization of high-dimensional Data Spaces , 2003 .

[32]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[33]  Arthur Flexer,et al.  Can Shared Nearest Neighbors Reduce Hubness in High-Dimensional Spaces? , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[34]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[35]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[36]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[37]  Robert J. Brunner,et al.  Robust Machine Learning Applied to Astronomical Data Sets. III. Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX , 2008, 0804.3413.

[38]  Sándor Juhász,et al.  High-Dimensional Data Visualization , 2005 .

[39]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[40]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[41]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[42]  Michel Verleysen,et al.  Learning high-dimensional data , 2001 .

[43]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[44]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[45]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.