Hierarchical information representation and efficient classification of gene expression microarray data

In the field of computational biology, microarryas are used to measure the activity of thousands of genes at once and create a global picture of cellular function. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and eficiently. Even if microarrays are a consolidated research technology nowadays and the trends in high-throughput data analysis are shifting towards new technologies like Next Generation Sequencing (NGS), an optimum method for sample classification has not been found yet. Microarray classification is a complicated task, not only due to the high dimensionality of the feature set, but also to an apparent lack of data structure. This characteristic limits the applicability of processing techniques, such as wavelet filtering or other filtering techniques that take advantage of known structural relation. On the other hand, it is well known that genes are not expressed independently from other each other: genes have a high interdependence related to the involved regulating biological process. This thesis aims to improve the current state of the art in microarray classification and to contribute to understand how signal processing techniques can be developed and applied to analyze microarray data. The goal of building a classification framework needs an exploratory work in which algorithms are constantly tried and adapted to the analyzed data. The developed algorithms and classification frameworks in this thesis tackle the problem with two essential building blocks. The first one deals with the lack of a priori structure by inferring a data-driven structure with unsupervised hierarchical clustering tools. The second key element is a proper feature selection tool to produce a precise classifier as an output and to reduce the overfitting risk. The main focus in this thesis is the binary data classification, field in which we obtained relevant improvements to the state of the art. The first key element is the data-driven structure, obtained by modifying hierarchical clustering algorithms derived from the Treelets algorithm from the literature. Several alternatives to the original reference algorithm have been tested, changing either the similarity metric to merge the feature or the way two feature are merged. Moreover, the possibility to include external sources of information from publicly available biological knowledge and ontologies to improve the structure generation has been studied too. About the feature selection, two alternative approaches have been studied: the first one is a modification of the IFFS algorithm as a wrapper feature selection, while the second approach involved an ensemble learning focus. To obtain good results, the IFFS algorithm has been adapted to the data characteristics by introducing new elements to the selection process like a reliability measure and a scoring system to better select the best feature at each iteration. The second feature selection approach is based on Ensemble learning, taking advantage of the microarryas feature abundance to implement a different selection scheme. New algorithms have been studied in this field, improving state of the art algorithms to the microarray data characteristic of small sample and high feature numbers. In addition to the binary classification problem, the multiclass case has been addressed too. A new algorithm combining multiple binary classifiers has been evaluated, exploiting the redundancy offered by multiple classifiers to obtain better predictions. All the studied algorithm throughout this thesis have been evaluated using high quality publicly available data, following established testing protocols from the literature to offer a proper benchmarking with the state of the art. Whenever possible, multiple Monte Carlo simulations have been performed to increase the robustness of the obtained results.

[1]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Florentino Fernández Riverola,et al.  Incorporating biological knowledge to microarray data classification through genomic data fusion , 2010, 2010 13th International Conference on Information Fusion.

[4]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[5]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[6]  Jugal K. Kalita,et al.  Gene expression data clustering analysis: A survey , 2011, 2011 2nd National Conference on Emerging Trends and Applications in Computer Science.

[7]  Youping Deng,et al.  Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data , 2009, PloS one.

[8]  W. Marsden I and J , 2012 .

[9]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[10]  Antonio Ortega,et al.  Microarray classification using block diagonal linear discriminant analysis with embedded feature selection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[12]  David Cameron,et al.  A stroma-related gene signature predicts resistance to neoadjuvant chemotherapy in breast cancer , 2009, Nature Medicine.

[13]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[14]  Yufei Huang,et al.  Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics , 2009, EURASIP J. Bioinform. Syst. Biol..

[15]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[16]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[17]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[18]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[19]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[20]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[21]  Francisco Azuaje,et al.  Genomic data sampling and its effect on classification performance assessment , 2003, BMC Bioinformatics.

[22]  T. Moon Error Correction Coding: Mathematical Methods and Algorithms , 2005 .

[23]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Florentino Fernández Riverola,et al.  Using Variable Precision Rough Set for Selection and Classification of Biological Knowledge Integrated in DNA Gene Expression , 2012, Journal of integrative bioinformatics.

[26]  Philippe Salembier,et al.  GENE EXPRESSION DATA CLASSIFICATION COMBINING HIERARCHICAL REPRESENTATION AND EFFICIENT FEATURE SELECTION , 2012 .

[27]  Chao Sima,et al.  Performance of Feature Selection Methods , 2009, Current genomics.

[28]  Tiejun Tong,et al.  Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis , 2007 .

[29]  A. Liekens,et al.  BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation , 2011, Genome Biology.

[30]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[31]  Maqc Consortium The MicroArray Quality Control ( MAQC )-II study of common practices for the development and validation of microarray-based predictive models , 2012 .

[32]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[33]  Philippe Salembier,et al.  Feature set enhancement via hierarchical clustering for microarray classification , 2011, 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS).

[34]  Ulisses Braga-Neto,et al.  Bolstered error estimation , 2004, Pattern Recognit..

[35]  Lawrence O. Hall,et al.  A New Ensemble Diversity Measure Applied to Thinning Ensembles , 2003, Multiple Classifier Systems.

[36]  Pau Bellot Pujalte,et al.  Study of gene expression representation with Treelets and hierarchical clustering algorithms , 2011 .

[37]  Elizabeth Tapia,et al.  Multiclass classification of microarray data samples with a reduced number of genes , 2011, BMC Bioinformatics.

[38]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[39]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Henrik Boström,et al.  Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods , 2007, IDEAL.

[41]  S. Suster,et al.  Applications and Limitations of Immunohistochemistry in the Diagnosis of Malignant Mesothelioma , 2006, Advances in anatomic pathology.

[42]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[43]  M. West,et al.  Gene expression predictors of breast cancer outcomes , 2003, The Lancet.

[44]  Robert E. Kalaba,et al.  On adaptive control processes , 1959 .

[45]  Francisco Herrera,et al.  A Survey on the Application of Genetic Programming to Classification , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[46]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[47]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[48]  Gene H. Golub,et al.  Matrix computations , 1983 .

[49]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[50]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[51]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[52]  Cheng Li,et al.  A Survey of Classification Techniques for Microarray Data Analysis , 2011, Handbook of Statistical Bioinformatics.

[53]  Giorgio Valentini,et al.  Ensembles in Machine Learning Applications , 2011, Studies in Computational Intelligence.

[54]  S. Dudoit,et al.  Comparison of discrimination methods for the classification of tumors using gene expression data , 2002 .

[55]  Md Nasir Sulaiman,et al.  Integrative Gene Selection for Classification of Microarray Data , 2011, Comput. Inf. Sci..

[56]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[57]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[58]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[59]  Ann B. Lee,et al.  Treelets--An adaptive multi-scale basis for sparse unordered data , 2007, 0707.0481.

[60]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[61]  Gil Alterovitz,et al.  Knowledge-Based Bioinformatics: From analysis to interpretation , 2010 .

[62]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[63]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[64]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[65]  David Casasent,et al.  An improvement on floating search algorithms for feature subset selection , 2009, Pattern Recognit..

[66]  V. Vapnik,et al.  A note one class of perceptrons , 1964 .

[67]  Xi Chen,et al.  Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer , 2009, J. Comput. Biol..

[68]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[69]  Joakim Lundeberg,et al.  Generations of sequencing technologies. , 2009, Genomics.

[70]  Joaquín Dopazo,et al.  Papers on normalization, variable selection, classification or clustering of microarray data , 2009, Bioinform..

[71]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[72]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[73]  Alfred O. Hero,et al.  Biological pathway inference using manifold embedding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[75]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[76]  E. S. Smirnov On Exact Methods in Systematics , 1968 .

[77]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[78]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[79]  Elizabeth Tapia,et al.  Recursive ECOC classification , 2010, Pattern Recognit. Lett..

[80]  Sandeep Kottath,et al.  Histogram based Hierarchical Data Representation for Microarray Classification , 2012 .

[81]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[82]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[83]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[84]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[85]  Xi Chen,et al.  Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes , 2008, Bioinform..

[86]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[87]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[88]  Lawrence Hunter,et al.  Assessing and Combining Reliability of Protein Interaction Sources , 2007, Pacific Symposium on Biocomputing.

[89]  Shaogang Gong,et al.  Recognising trajectories of facial identities using kernel discriminant analysis , 2003, Image and Vision Computing.

[90]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.

[91]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[92]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[93]  Allen Y. Yang,et al.  Fast ℓ1-minimization algorithms and an application in robust face recognition: A review , 2010, 2010 IEEE International Conference on Image Processing.

[94]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[95]  Thomas Marill,et al.  On the effectiveness of receptors in recognition systems , 1963, IEEE Trans. Inf. Theory.

[96]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[97]  Hinrich W. H. Göhlmann,et al.  Gene Expression Studies Using Affymetrix Microarrays , 2009, Chapman and Hall / CRC mathematical and computational biology series.

[98]  Michael D. Vose,et al.  The simple genetic algorithm - foundations and theory , 1999, Complex adaptive systems.

[99]  Ed Keedwell,et al.  Genetic Algorithms for Gene Expression Analysis , 2003, EvoWorkshops.

[100]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[101]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[102]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[103]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[104]  Aidong Zhang,et al.  Selecting informative genes from microarray dataset by incorporating gene ontology , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[105]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[106]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[107]  Ju Han Kim,et al.  Chapter 8: Biological Knowledge Assembly and Interpretation , 2012, PLoS Comput. Biol..

[108]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[109]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[110]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[111]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..