Comparative Evaluation of Set-Level Techniques in Microarray Classification

Analysis of gene expression data in terms of a priori-defined gene sets typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set-level strategy can also be adopted in predictive classification tasks accomplished with machine learning algorithms. Here, sample features originally corresponding to genes are replaced by a much smaller number of features, each corresponding to a gene set and aggregating expressions of its members into a single real value. Classifiers learned from such transformed features promise better interpretability in that they derive class predictions from overall expressions of selected gene sets (e.g. corresponding to pathways) rather than expressions of specific genes. In a large collection of experiments we test how accurate such classifiers are compared to traditional classifiers based on genes. Furthermore, we translate some recently published gene set analysis techniques to the above proposed machine learning setting and assess their contributions to the classification accuracies.

[1]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[2]  Pooja Mittal,et al.  A novel signaling pathway impact analysis , 2009, Bioinform..

[3]  Wei Li Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment Analysis , 2009 .

[4]  M. Cotreau,et al.  Molecular classification of Crohn's disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. , 2006, The Journal of molecular diagnostics : JMD.

[5]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[6]  Elias Zintzaras,et al.  Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data , 2010, Comput. Biol. Medicine.

[7]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. Growdon,et al.  Molecular markers of early Parkinson's disease based on gene expression in blood , 2007, Proceedings of the National Academy of Sciences.

[9]  Yumei Song,et al.  A Novel Signaling Pathway , 2009, The Journal of Biological Chemistry.

[10]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[11]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[12]  Shuichi Tsutsumi,et al.  Global gene expression analysis of gastric cancer by oligonucleotide microarrays. , 2002, Cancer research.

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[16]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[17]  Ian Witten,et al.  Data Mining , 2000 .

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  S. Horvath,et al.  Gene Expression Profiling of Gliomas Strongly Predicts Survival , 2004, Cancer Research.

[20]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[21]  Yixin Wang,et al.  Novel Genes Associated with Malignant Melanoma but not Benign Melanocytic Lesions , 2005, Clinical Cancer Research.

[22]  Oliver Eulenstein,et al.  Bioinformatics Research and Applications , 2008 .

[23]  A. Heguy,et al.  Up-regulation of expression of the ubiquitin carboxyl-terminal hydrolase L1 gene in human airway epithelium of cigarette smokers. , 2006, Cancer research.

[24]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[25]  Hong Fang,et al.  Decision forest for classification of gene expression data , 2010, Comput. Biol. Medicine.

[26]  Qi Liu,et al.  Improving gene set analysis of microarray data by SAM-GS , 2007, BMC Bioinformatics.

[27]  P. Park,et al.  Angiogenic profile of soft tissue sarcomas based on analysis of circulating factors and microarray gene expression. , 2006, The Journal of surgical research.

[28]  K. Bretonnel Cohen,et al.  Ontology quality assurance through analysis of term transformations , 2009, Bioinform..

[29]  Filip Zelezný,et al.  Integrating Multiple-Platform Expression Data through Gene Set Features , 2009, ISBRA.

[30]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.

[31]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .