Effective gene expression data generation framework based on multi-model approach

OBJECTIVE Overcome the lack of enough samples in gene expression data sets having thousands of genes but a small number of samples challenging the computational methods using them. METHODS AND MATERIAL This paper introduces a multi-model artificial gene expression data generation framework where different gene regulatory network (GRN) models contribute to the final set of samples based on the characteristics of their underlying paradigms. In the first stage, we build different GRN models, and sample data from each of them separately. Then, we pool the generated samples into a rich set of gene expression samples, and finally try to select the best of the generated samples based on a multi-objective selection method measuring the quality of the generated samples from three different aspects such as compatibility, diversity and coverage. We use four alternative GRN models, namely, ordinary differential equations, probabilistic Boolean networks, multi-objective genetic algorithm and hierarchical Markov model. RESULTS We conducted a comprehensive set of experiments based on both real-life biological and synthetic gene expression data sets. We show that our multi-objective sample selection mechanism effectively combines samples from different models having up to 95% compatibility, 10% diversity and 50% coverage. We show that the samples generated by our framework has up to 1.5x higher compatibility, 2x higher diversity and 2x higher coverage than the samples generated by the individual models that the multi-model framework uses. Moreover, the results show that the GRNs inferred from the samples generated by our framework can have 2.4x higher precision, 12x higher recall, and 5.4x higher f-measure values than the GRNs inferred from the original gene expression samples. CONCLUSIONS Therefore, we show that, we can significantly improve the quality of generated gene expression samples by integrating different computational models into one unified framework without dealing with complex internal details of each individual model. Moreover, the rich set of artificial gene expression samples is able to capture some biological relations that can even not be captured by the original gene expression data set.

[1]  Cheng Fang,et al.  Gene Expression Data Classification Using Artificial Neural Network Ensembles Based on Samples Filtering , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[2]  Peter Kokol,et al.  Stability of Ranked Gene Lists in Large Microarray Analysis Studies , 2010, Journal of biomedicine & biotechnology.

[3]  Aniruddha Datta,et al.  External control in Markovian genetic regulatory networks: the imperfect information case , 2004, Bioinform..

[4]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[5]  Gilbert Syswerda,et al.  Uniform Crossover in Genetic Algorithms , 1989, ICGA.

[6]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[7]  Gregory Piatetsky-Shapiro,et al.  Capturing best practice for microarray gene expression data analysis , 2003, KDD '03.

[8]  Hidde de Jong,et al.  Modeling and Simulation of Genetic Regulatory Systems: A Literature Review , 2002, J. Comput. Biol..

[9]  Aniruddha Datta,et al.  Optimal infinite horizon control for probabilistic Boolean networks , 2006, 2006 American Control Conference.

[10]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[11]  D. Floreano,et al.  Revealing strengths and weaknesses of methods for gene network inference , 2010, Proceedings of the National Academy of Sciences.

[12]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[13]  Ahmet Sacan,et al.  Data simulation and regulatory network reconstruction from time-series microarray data using stepwise multiple linear regression , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[14]  Stephanie Forrest,et al.  Genetic algorithms, operators, and DNA fragment assembly , 1995, Machine Learning.

[15]  Dario Floreano,et al.  GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods , 2011, Bioinform..

[16]  Satoru Miyano,et al.  Dynamic Bayesian Network and Nonparametric Regression for Nonlinear Modeling of Gene Networks from Time Series Gene Expression Data , 2003, CMSB.

[17]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[18]  M. van Iterson,et al.  Relative power and sample size analysis on gene expression profiling data , 2009, BMC Genomics.

[19]  Patrik D'haeseleer,et al.  Linear Modeling of mRNA Expression Levels During CNS Development and Injury , 1998, Pacific Symposium on Biocomputing.

[20]  J Timmer,et al.  Quantitative data generation for systems biology: the impact of randomisation, calibrators and normalisers. , 2005, Systems biology.

[21]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[22]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[23]  E. Crampin,et al.  Reconstructing gene regulatory networks: from random to scale-free connectivity. , 2006, Systems biology.

[24]  Zheng Li,et al.  Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis , 2011, Bioinform..

[25]  N. D. Clarke,et al.  Correction: Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PLoS ONE.

[26]  Reda Alhajj,et al.  Employing Machine Learning Techniques for Data Enrichment: Increasing the Number of Samples for Effective Gene Expression Data Analysis , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[27]  A. Datta,et al.  External Control in Markovian Genetic Regulatory Networks , 2003, Proceedings of the 2003 American Control Conference, 2003..

[28]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[29]  J. Collins,et al.  Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks , 2005, Nature Biotechnology.

[30]  J. Collins,et al.  Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling , 2003, Science.

[31]  Richard Bonneau,et al.  The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo , 2006, Genome Biology.

[32]  Edward R. Dougherty,et al.  Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks , 2002, Bioinform..

[33]  Benjamin E Dunmore,et al.  Gene network inference and visualization tools for biologists: application to new human transcriptome datasets , 2011, Nucleic acids research.

[34]  Diego di Bernardo,et al.  Inference of gene regulatory networks and compound mode of action from time course gene expression profiles , 2006, Bioinform..

[35]  G A Whitmore,et al.  Power and sample size for DNA microarray studies , 2002, Statistics in medicine.

[36]  Henriette Franz,et al.  Systematic analysis of gene expression in human brains before and after death , 2005, Genome Biology.

[37]  Reda Alhajj,et al.  Integrating machine learning techniques into robust data enrichment approach and its application to gene expression data , 2013, Int. J. Data Min. Bioinform..

[38]  Peter J. Fleming,et al.  Genetic Algorithms for Multiobjective Optimization: FormulationDiscussion and Generalization , 1993, ICGA.

[39]  Dario Floreano,et al.  Generating Realistic In Silico Gene Networks for Performance Assessment of Reverse Engineering Methods , 2009, J. Comput. Biol..

[40]  Stuart A. Kauffman,et al.  The origins of order , 1993 .

[41]  Reda Alhajj,et al.  Effective Enrichment of Gene Expression Data Sets , 2012, 2012 11th International Conference on Machine Learning and Applications.

[42]  Paul P. Wang,et al.  Advances to Bayesian network inference for generating causal networks from observational biological data , 2004, Bioinform..

[43]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[44]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[45]  Daniel Marbach,et al.  Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PloS one.

[46]  Ian H. Witten,et al.  Adaptive text mining: inferring structure from sequences , 2004, J. Discrete Algorithms.

[47]  Ju Han Kim,et al.  Mixture-model based estimation of gene expression variance from public database improves identification of differentially expressed genes in small sized microarray data , 2009, Bioinform..

[48]  J Moult,et al.  Genetic algorithms for protein structure prediction. , 1996, Current opinion in structural biology.