Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data

Models learned from high-dimensional spaces, where the high number of features can exceed the number of observations, are susceptible to overfit since the selection of subspaces of interest for the learning task is prone to occur by chance. In these spaces, the performance of models is commonly highly variable and dependent on the target error estimators, data regularities and model properties. High-variable performance is a common problem in the analysis of omics data, healthcare data, collaborative filtering data, and datasets composed by features extracted from unstructured data or mapped from multi-dimensional databases. In these contexts, assessing the statistical significance of the performance guarantees of models learned from these high-dimensional spaces is critical to validate and weight the increasingly available scientific statements derived from the behavior of these models. Therefore, this chapter surveys the challenges and opportunities of evaluating models learned from big data settings from the less-studied angle of big dimensionality. In particular, we propose a methodology to bound and compare the performance of multiple models. First, a set of prominent challenges is synthesized. Second, a set of principles is proposed to answer the identified challenges. These principles provide a roadmap with decisions to: i) select adequate statistical tests, loss functions and sampling schema, ii) infer performance guarantees from multiple settings, including varying data regularities and learning parameterizations, and iii) guarantee its applicability for different types of models, including classification and descriptive models. To our knowledge, this work is the first attempt to provide a robust and flexible assessment of distinct types of models sensitive to both the dimensionality and size of data. Empirical evidence supports the relevance of these principles as they offer a coherent setting to bound and compare the performance of models learned in high-dimensional spaces, and to study and refine the behavior of these models.

[1]  Panlop Zeephongsekul,et al.  Predicting the Relationship Between the Size of Training Sample and the Predictive Power of Classifiers , 2004, KES.

[2]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[3]  R. R. Hocking Methods and Applications of Linear Models: Regression and the Analysis of Variance , 2003 .

[4]  King-Sun Fu,et al.  Error estimation in pattern recognition via LAlpha -distance between posterior density functions , 1976, IEEE Trans. Inf. Theory.

[5]  Z. Shkedy,et al.  Exploration and Analysis of DNA Microarray and Other High-Dimensional Data , 2014 .

[6]  Martin Vingron,et al.  DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach , 2011, Algorithms for Molecular Biology.

[7]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[8]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[9]  Ned Glick,et al.  Additive estimators for probabilities of correct classification , 1978, Pattern Recognit..

[10]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[11]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[12]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[13]  D. Hirschberg,et al.  Small sample statistics for classi cation error rates II: Con-dence intervals and signi cance tests , 1995 .

[14]  J. Popp,et al.  Sample size planning for classification models. , 2012, Analytica chimica acta.

[15]  Lubomir M. Hadjiiski,et al.  Effect of finite sample size on feature selection and classification: a simulation study. , 2010, Medical physics.

[16]  R. Simon,et al.  Sample size determination in microarray experiments for class comparison and prognostic classification. , 2005, Biostatistics.

[17]  A. G. Wacker,et al.  Effect of dimensionality and estimation on the performance of gaussian classifiers , 1980, Pattern Recognit..

[18]  Alexander Shapiro Simulation based optimization , 1996, Winter Simulation Conference.

[19]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[20]  Claudio Gentile,et al.  Sample Size Lower Bounds in PAC Learning by Algorithmic Complexity Theory , 1998, Theor. Comput. Sci..

[21]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, and Classification Error of Nonparametric Linear Classification Algorithms , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[23]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[24]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[25]  Shanthi Nagarajan,et al.  IKKβ inhibitor identification: a multi-filter driven novel scaffold , 2010, BMC Bioinformatics.

[26]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  R. Simon,et al.  Sample size planning for developing classifiers using high-dimensional DNA microarray data. , 2007, Biostatistics.

[28]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Andreas König,et al.  Towards Effective Unbiased Automated Feature Selection , 2006, 2006 Sixth International Conference on Hybrid Intelligent Systems (HIS'06).

[30]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[31]  C. Adcock Sample size determination : a review , 1997 .

[32]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[33]  Paul Horton,et al.  A biclustering method for gene expression module discovery using a closed itemset enumeration algorithm , 2007 .

[34]  Ümit V. Çatalyürek,et al.  Comparative analysis of biclustering algorithms , 2010, BCB '10.

[35]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[36]  Blaise Hanczar,et al.  Performance of Error Estimators for Classification , 2010 .

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[39]  J. V. Ness,et al.  On the Effects of Dimension in Discriminant Analysis , 1976 .

[40]  B. Chandrasekaran,et al.  On dimensionality and sample size in statistical pattern classification , 1971, Pattern Recognit..

[41]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[42]  Z. Shkedy,et al.  Exploration and Analysis of DNA Microarray and Other High-Dimensional Data (2nd edition). , 2014 .

[43]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[44]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[45]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[46]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[47]  J. Fleiss Statistical methods for rates and proportions , 1974 .

[48]  Nitesh V. Chawla,et al.  Consequences of Variability in Classifier Performance Estimates , 2010, 2010 IEEE International Conference on Data Mining.

[49]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[50]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[51]  Mohammed J. Zaki,et al.  SCHISM: a new approach for interesting subspace mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[52]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[53]  Huan Liu,et al.  Feature subset selection bias for classification learning , 2006, ICML.

[54]  Yu Guo,et al.  Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms , 2010, BMC Bioinformatics.

[55]  Gengsheng Qin,et al.  Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test , 2008, Statistical methods in medical research.

[56]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[57]  Godfried T. Toussaint,et al.  Bibliography on estimation of misclassification , 1974, IEEE Trans. Inf. Theory.

[58]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[59]  Jae K. Lee,et al.  Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays , 2003, Bioinform..

[60]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[61]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[62]  Robert Azencott,et al.  Distribution-Dependent Vapnik-Chervonenkis Bounds , 1999, EuroCOLT.

[63]  Isabelle Guyon,et al.  What Size Test Set Gives Good Error Rate Estimates? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[64]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[65]  D. J. Hand,et al.  Recent advances in error rate estimation , 1986, Pattern Recognit. Lett..