Microarray classification with hierarchical data representation and novel feature selection criteria

Microarray data classification is a challenging problem due to the high number of variables compared to the small number of available samples. An effective methodology to output a precise and reliable classifier is proposed in this work as an improvement of the algorithm in [1]. It considers the sample scarcity problem and the lack of data structure typical of microarrays. Both problem are assessed by a two-step approach applying hierarchical clustering to create new features called metagenes and introducing a novel feature ranking criterion, inside the wrapper feature selection task. The classification ability has been evaluated on 4 publicly available datasets from Micro Array Quality Control study phase II (MAQC) classified by 7 different endpoints. The global results have showed how the proposed approach obtains better prediction accuracy than a wide variety of state of the art alternatives.

[1]  Youping Deng,et al.  Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data , 2009, PloS one.

[2]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[3]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[4]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[5]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[6]  David G. Stork,et al.  Pattern Classification , 1973 .

[7]  Ann B. Lee,et al.  Treelets--An adaptive multi-scale basis for sparse unordered data , 2007, 0707.0481.

[8]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[9]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[10]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[11]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[12]  David Casasent,et al.  An improvement on floating search algorithms for feature subset selection , 2009, Pattern Recognit..

[13]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[14]  Philippe Salembier,et al.  Feature set enhancement via hierarchical clustering for microarray classification , 2011, 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS).

[15]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[16]  Cheng Li,et al.  A Survey of Classification Techniques for Microarray Data Analysis , 2011, Handbook of Statistical Bioinformatics.