Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes

We suggest a technique, related to the concept of ‘detection boundary’ that was developed by Ingster and by Donoho and Jin, for comparing the theoretical performance of classifiers constructed from small training samples of very large vectors. The resulting ‘classification boundaries’ are obtained for a variety of distance‐based methods, including the support vector machine, distance‐weighted discrimination and kth‐nearest‐neighbour classifiers, for thresholded forms of those methods, and for techniques based on Donoho and Jin's higher criticism approach to signal detection. Assessed in these terms, standard distance‐based methods are shown to be capable only of detecting differences between populations when those differences can be estimated consistently. However, the thresholded forms of distance‐based classifiers can do better, and in particular can correctly classify data even when differences between distributions are only detectable, not estimable. Other methods, including higher criticism classifiers, can on occasion perform better still, but they tend to be more limited in scope, requiring substantially more information about the marginal distributions. Moreover, as tail weight becomes heavier the classification boundaries of methods designed for particular distribution types can converge to, and achieve, the boundary for thresholded nearest neighbour approaches. For example, although higher criticism has a lower classification boundary, and in this sense performs better, in the case of normal data, the boundaries are identical for exponentially distributed data when both sample sizes equal 1.

[1]  I. Ibragimov,et al.  Some Limit Theorems for Stationary Processes , 1962 .

[2]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[3]  Michael Wolf,et al.  Subsampling for heteroskedastic time series , 1997 .

[4]  S. Altan,et al.  The analysis of small-sample multivariate data. , 1998, Journal of Biopharmaceutical Statistics.

[5]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  I. Johnstone On the distribution of the largest principal component , 2000 .

[11]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[12]  Peter Rockett,et al.  The training of neural classifiers with condensed datasets , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[13]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[14]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[15]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[16]  Jiashun Jin,et al.  Detecting a target in very noisy data from multiple looks , 2004 .

[17]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[18]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[19]  L. Cayon,et al.  Higher Criticism statistic: detecting and identifying non-Gaussianity in the WMAP first-year data , 2005 .

[20]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[21]  HIGHER CRITICISM STATISTIC: THEORY AND APPLICATIONS IN NON-GAUSSIAN DETECTION , 2005 .

[22]  Jean-Luc Starck,et al.  Cosmological Non-Gaussian Signature Detection: Comparing Performance of Different Statistical Tests , 2005, EURASIP J. Adv. Signal Process..

[23]  Noureddine El Karoui,et al.  Recent Results About the Largest Eigenvalue of Random Covariance Matrices and Statistical Application , 2005 .

[24]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[25]  Institute of Theoretical Astrophysics,et al.  No Higher Criticism of the Bianchi Corrected WMAP Data , 2006, astro-ph/0602023.

[26]  H. K. Eriksen,et al.  No Higher Criticism of the Bianchi-corrected Wilkinson Microwave Anisotropy Probe data , 2006 .

[27]  N. Meinshausen,et al.  Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses , 2005, math/0501289.

[28]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .