Study of gene expression representation with Treelets and hierarchical clustering algorithms

English: Since the mid-1990's, the field of genomic signal processing has exploded due to the development of DNA microarray technology, which made possible the measurement of mRNA expression of thousands of genes in parallel. Researchers had developed a vast body of knowledge in classification methods. However, microarray data is characterized by extremely high dimensionality and comparatively small number of data points. This makes microarray data analysis quite unique. In this work we have developed various hierarchical clustering algorthims in order to improve the microarray classification task. At first, the original feature set of gene expression values are enriched with new features that are linear combinations of the original ones. These new features are called metagenes and are produced by different proposed hierarchical clustering algorithms. In order to prove the utility of this methodology to classify microarray datasets the building of a reliable classifier via feature selection process is introduced. This methodology has been tested on three public cancer datasets: Colon, Leukemia and Lymphoma. The proposed method has obtained better classification results than if this enhancement is not performed. Confirming the utility of the metagenes generation to improve the final classifier. Secondly, a new technique has been developed in order to use the hierarchical clustering to perform a reduction on the huge microarray datasets, removing the initial genes that will not be relevant for the cancer classification task. The experimental results of this method are also presented and analyzed when it is applied to one public database demonstrating the utility of this new approach.

[1]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[2]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[3]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[5]  Antonio Ortega,et al.  Treelets as feature transformation tool for block diagonal linear discrimination , 2009, 2009 IEEE International Workshop on Genomic Signal Processing and Statistics.

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Ann B. Lee,et al.  Treelets--An adaptive multi-scale basis for sparse unordered data , 2007, 0707.0481.

[9]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[10]  Ulisses Braga-Neto,et al.  Bolstered error estimation , 2004, Pattern Recognit..

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[13]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[16]  Johan A. K. Suykens,et al.  Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction , 2004, Bioinform..

[17]  J. Peto Breast cancer susceptibility-A new look at an old model. , 2002, Cancer cell.

[18]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[19]  David Casasent,et al.  An improvement on floating search algorithms for feature subset selection , 2009, Pattern Recognit..

[20]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[21]  John W. Tukey,et al.  Exploratory Data Analysis , 1980, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[22]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  D. Cavalieri,et al.  Fundamentals of cDNA microarray data analysis. , 2003, Trends in genetics : TIG.