Histogram based Hierarchical Data Representation for Microarray Classification

[ANGLES] A general framework for microarray classification relying on histogram based hierarchical clustering is proposed in this work. It produces precise and reliable classifiers based on a two-step approach. In the first step, the feature set is enhanced by histogram based features corresponding to each cluster produced via hierarchical clustering, where a parameter (maximum number of dominant genes) can be tuned based on the dataset characteristics. In the second step, a reliable classifier is built from a wrapper feature selection process called Improved Sequential Floating Forward Selection (IFFS) to properly choose a small feature set for the classification task. Considering the sample scarcity in the microarray datasets, a reliability parameter has been considered to improve the feature selection process along with classification error rate. Different combinations of error rate and reliability has been used as the scoring rule. Linear Discriminant Analysis (LDA) and K-Nearest Neighbour (KNN) classifiers have been used for this work and the performances has been compared. The potential of the proposed framework has been evaluated with three publicly available datasets : colon, lymphoma and leukaemia. The experimental results have confirmed the usefulness of the histogram based hierarchical clustering and the new representative feature generation algorithm. A gene level analysis has revealed that the best features selected by the feature selection algorithm has only very few basic constituent genes involved. The comparative results showed that the proposed framework can compete with state of the art alternatives.

[1]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[2]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[3]  Ann B. Lee,et al.  Treelets--An adaptive multi-scale basis for sparse unordered data , 2007, 0707.0481.

[4]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[5]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[8]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[9]  Michael L. Bittner,et al.  Which is better for cDNA-microarray-based classification: ratios or direct intensities , 2004, Bioinform..

[10]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[11]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[12]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[13]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[14]  David Casasent,et al.  An improvement on floating search algorithms for feature subset selection , 2009, Pattern Recognit..

[15]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[16]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[17]  Cheng Li,et al.  A Survey of Classification Techniques for Microarray Data Analysis , 2011, Handbook of Statistical Bioinformatics.

[18]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[19]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[21]  D. Coomans,et al.  Alternative k-nearest neighbour rules in supervised pattern recognition : Part 1. k-Nearest neighbour classification by using alternative voting rules , 1982 .

[22]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Yiming Yang,et al.  Analysis of recursive gene selection approaches from microarray data , 2005, Bioinform..

[25]  R. Bellman Dynamic programming. , 1957, Science.

[26]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.