An insight on complexity measures and classification in microarray data

Microarray data classification has been typically seen as a difficult challenge for machine learning researchers mainly due to its high dimension in feature while sample size is small. However, this type of data presents other complications such as overlapping between classes, dataset shift, class imbalance, non-linearity, or features extracted under extremely different distributions. This paper intends to analyze in depth the theoretical complexity of several popular binary datasets, by making use of complexity measures, and then connecting it with the empirical results obtained by four widely-used classifiers. Two different situations are covered: datasets with only training set and datasets originally divided into training and test sets. In both cases it is demonstrated that there exists a correlation between the complexity measures and the actual error rates, which can facilitate in the future how to deal with a given dataset. Finally, we present a case study on Prostate dataset, improving the test classification accuracy from 53% to 97%.

[1]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[2]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[3]  Pat Langley,et al.  Average-Case Analysis of a Nearest Neighbor Algorithm , 1993, IJCAI.

[4]  Félix Fernando González Navarro,et al.  Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains , 2011 .

[5]  Ester Bernadó-Mansilla,et al.  Evolutionary rule-based systems for imbalanced data sets , 2008, Soft Comput..

[6]  Oleg Okun,et al.  Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors , 2009, Artif. Intell. Medicine.

[7]  P. Langley,et al.  Average-case analysis of a nearest neighbor algorthim , 1993, IJCAI 1993.

[8]  T. Ho,et al.  Data Complexity in Pattern Recognition , 2006 .

[9]  Verónica Bolón-Canedo,et al.  On the effectiveness of discretization on gene selection of microarray data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[10]  M. Narasimha Murty,et al.  On Improving the Generalization of SVM Classifier , 2011 .

[11]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Ana Carolina Lorena,et al.  Complexity measures of supervised classifications tasks: A case study for cancer gene expression data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[13]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[14]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Ana Carolina Lorena,et al.  Analysis of complexity indices for classification problems: Cancer gene expression data , 2012, Neurocomputing.

[17]  Verónica Bolón-Canedo,et al.  Data classification using an ensemble of filters , 2014, Neurocomputing.

[18]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[19]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[20]  Ana Carolina Lorena,et al.  Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets , 2009, BSB.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[23]  Ana Carolina Lorena,et al.  On the Complexity of Gene Marker Selection , 2010, 2010 Eleventh Brazilian Symposium on Neural Networks.

[24]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..