Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays

BackgroundReproducibility of results can have a significant impact on the acceptance of new technologies in gene expression analysis. With the recent introduction of the so-called next-generation sequencing (NGS) technology and established microarrays, one is able to choose between two completely different platforms for gene expression measurements. This study introduces a novel methodology for gene-ranking stability analysis that is applied to the evaluation of gene-ranking reproducibility on NGS and microarray data.ResultsThe same data used in a well-known MicroArray Quality Control (MAQC) study was also used in this study to compare ranked lists of genes from MAQC samples A and B, obtained from Affymetrix HG-U133 Plus 2.0 and Roche 454 Genome Sequencer FLX platforms. An initial evaluation, where the percentage of overlapping genes was observed, demonstrates higher reproducibility on microarray data in 10 out of 11 gene-ranking methods. A gene set enrichment analysis shows similar enrichment of top gene sets when NGS is compared with microarrays on a pathway level. Our novel approach demonstrates high accuracy of decision trees when used for knowledge extraction from multiple bootstrapped gene set enrichment analysis runs. A comparison of the two approaches in sample preparation for high-throughput sequencing shows that alternating decision trees represent the optimal knowledge representation method in comparison with classical decision trees.ConclusionsUsual reproducibility measurements are mostly based on statistical techniques that offer very limited biological insights into the studied gene expression data sets. This paper introduces the meta-learning-based gene set enrichment analysis that can be used to complement the analysis of gene-ranking stability estimation techniques such as percentage of overlapping genes or classic gene set enrichment analysis. It is useful and practical when reproducibility of gene ranking results or different gene selection techniques is observed. The proposed method reveals very accurate descriptive models that capture the co-enrichment of gene sets which are differently enriched in the compared data sets.

[1]  J. Mesirov,et al.  An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis , 2005, Nature Genetics.

[2]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[3]  Baolin Wu,et al.  Differential gene expression detection using penalized linear regression models: the improved SAM statistics , 2005, Bioinform..

[4]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[5]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[7]  T. Golub,et al.  A Mechanism of Cyclin D1 Action Encoded in the Patterns of Gene Expression in Human Cancer , 2003, Cell.

[8]  Roland Eils,et al.  Group testing for pathway analysis improves comparability of different microarray datasets , 2006, Bioinform..

[9]  B. Wold,et al.  Sequence census methods for functional genomics , 2008, Nature Methods.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  J. Mariadason,et al.  Genetic reprogramming in pathways of colonic cell maturation induced by short chain fatty acids: comparison with trichostatin A, sulindac, and curcumin and implications for chemoprevention of colon cancer. , 2000, Cancer research.

[12]  J. Shendure The beginning of the end for microarrays? , 2008, Nature Methods.

[13]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[14]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[15]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[16]  Shuangge Ma BMC Bioinformatics BioMed Central Methodology article Empirical study of supervised gene screening , 2006 .

[17]  R. Søkilde,et al.  Quantitative miRNA expression analysis: comparing microarrays with next-generation sequencing. , 2009, RNA.

[18]  P. Khatri,et al.  Profiling gene expression using onto-express. , 2002, Genomics.

[19]  Xing Qiu,et al.  Assessing stability of gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[20]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[21]  T. Harkins,et al.  Transcriptome sequencing of the Microarray Quality Control (MAQC) RNA reference samples using next generation sequencing , 2009, BMC Genomics.

[22]  Qi Liu,et al.  Gene-set analysis and reduction , 2008, Briefings Bioinform..

[23]  Ricardo Vilalta,et al.  Using Meta-Learning to Support Data Mining , 2004, Int. J. Comput. Sci. Appl..

[24]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[25]  Michel Bouvier,et al.  Requirements and ontology for a G protein-coupled receptor oligomerization knowledge base , 2007, BMC Bioinformatics.

[26]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[27]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[28]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[29]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[30]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[31]  Mark D. Robinson,et al.  FunSpec: a web-based cluster interpreter for yeast , 2002, BMC Bioinformatics.

[32]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[33]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[34]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[35]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[36]  Michael A. Black,et al.  Microarray-based gene set analysis: a comparison of current methods , 2008, BMC Bioinformatics.

[37]  Jill P. Mesirov,et al.  GSEA-P: a desktop application for Gene Set Enrichment Analysis , 2007, Bioinform..

[38]  John Crowley,et al.  The molecular classification of multiple myeloma. , 2006, Blood.

[39]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[40]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..