Data complexity assessment in undersampled classification of high-dimensional biomedical data

Regularized linear classifiers have been successfully applied in undersampled, i.e. small sample size/high dimensionality biomedical classification problems. Additionally, a design of data complexity measures was proposed in order to assess the competence of a classifier in a particular context. Our work was motivated by the analysis of ill-posed regression problems by Elden and the interpretation of linear discriminant analysis as a mean square error classifier. Using Singular Value Decomposition analysis, we define a discriminatory power spectrum and show that it provides useful means of data complexity assessment for undersampled classification problems. In five real-life biomedical data sets of increasing difficulty we demonstrate how the data complexity of a classification problem can be related to the performance of regularized linear classifiers. We show that the concentration of the discriminatory power manifested in the discriminatory power spectrum is a deciding factor for the success of the regularized linear classifiers in undersampled classification problems. As a practical outcome of our work, the proposed data complexity assessment may facilitate the choice of a classifier for a given undersampled problem.

[1]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[2]  T. Subba Rao,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB , 2004 .

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  A. Phatak,et al.  The geometry of partial least squares , 1997 .

[5]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[7]  I. Jolliffe Principal Component Analysis , 2002 .

[8]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Sarunas Raudys,et al.  Comparison of Two Classification Methodologies on a Real-World Biomedical Problem , 2002, SSPR/SPR.

[11]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[12]  Ray L. Somorjai,et al.  Accurate diagnosis and prognosis of human cancers by proton MRS and a three-stage classification strategy , 2002 .

[13]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[14]  R. Somorjai,et al.  Rapid Identification of Candida Species by Using Nuclear Magnetic Resonance Spectroscopy and a Statistical Classification Strategy , 2003, Applied and Environmental Microbiology.

[15]  Per Christian Hansen,et al.  Rank-Deficient and Discrete Ill-Posed Problems , 1996 .

[16]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[17]  Åke Björck Acta Numerica 2004: The calculation of linear least squares problems , 2004 .

[18]  Lars Eldén,et al.  Partial least-squares vs. Lanczos bidiagonalization - I: analysis of a projection method for multiple regression , 2004, Comput. Stat. Data Anal..

[19]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Jieping Ye,et al.  An optimization criterion for generalized discriminant analysis on undersampled problems , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.