Integrating machine learning techniques into robust data enrichment approach and its application to gene expression data

The availability of enough samples for effective analysis and knowledge discovery has been a challenge in the research community, especially in the area of gene expression data analysis. Thus, the approaches being developed for data analysis have mostly suffered from the lack of enough data to train and test the constructed models. We argue that the process of sample generation could be successfully automated by employing some sophisticated machine learning techniques. An automated sample generation framework could successfully complement the actual sample generation from real cases. This argument is validated in this paper by describing a framework that integrates multiple models (perspectives) for sample generation. We illustrate its applicability for producing new gene expression data samples, a highly demanding area that has not received attention. The three perspectives employed in the process are based on models that are not closely related. The independence eliminates the bias of having the produced approach covering only certain characteristics of the domain and leading to samples skewed towards one direction. The first model is based on the Probabilistic Boolean Network (PBN) representation of the gene regulatory network underlying the given gene expression data. The second model integrates Hierarchical Markov Model (HIMM) and the third model employs a genetic algorithm in the process. Each model learns as much as possible characteristics of the domain being analysed and tries to incorporate the learned characteristics in generating new samples. In other words, the models base their analysis on domain knowledge implicitly present in the data itself. The developed framework has been extensively tested by checking how the new samples complement the original samples. The produced results are very promising in showing the effectiveness, usefulness and applicability of the proposed multi-model framework.

[1]  Gregory Piatetsky-Shapiro,et al.  Capturing best practice for microarray gene expression data analysis , 2003, KDD '03.

[2]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[3]  Edward R. Dougherty,et al.  Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks , 2002, Bioinform..

[4]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[5]  Mohammed Al-Shalalfa,et al.  Influence of Prior Knowledge in Constraint-Based Learning of Gene Regulatory Networks , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Henriette Franz,et al.  Systematic analysis of gene expression in human brains before and after death , 2005, Genome Biology.

[7]  Ian H. Witten,et al.  Adaptive text mining: inferring structure from sequences , 2004, J. Discrete Algorithms.

[8]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[10]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[11]  Cheng Fang,et al.  Gene Expression Data Classification Using Artificial Neural Network Ensembles Based on Samples Filtering , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[12]  Stephanie Forrest,et al.  Genetic Algorithms for DNA Sequence Assembly , 1993, ISMB.

[13]  Gilbert Syswerda,et al.  Uniform Crossover in Genetic Algorithms , 1989, ICGA.

[14]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[15]  J Timmer,et al.  Quantitative data generation for systems biology: the impact of randomisation, calibrators and normalisers. , 2005, Systems biology.

[16]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[17]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[18]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[19]  Stuart A. Kauffman,et al.  The origins of order , 1993 .

[20]  M. van Iterson,et al.  Relative power and sample size analysis on gene expression profiling data , 2009, BMC Genomics.

[21]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[22]  Chris Thornton Hierarchical Markov Modeling for Generative Music , 2009, ICMC.

[23]  C. Furlanello,et al.  Functional analysis of multiple genomic signatures demonstrates that classification algorithms choose phenotype-related genes , 2010, The Pharmacogenomics Journal.

[24]  Ju Han Kim,et al.  Mixture-model based estimation of gene expression variance from public database improves identification of differentially expressed genes in small sized microarray data , 2009, Bioinform..

[25]  J Moult,et al.  Genetic algorithms for protein structure prediction. , 1996, Current opinion in structural biology.

[26]  Edward R. Dougherty,et al.  Coefficient of determination in nonlinear signal processing , 2000, Signal Process..

[27]  N. Arden,et al.  Quantifying stability in gene list ranking across microarray derived clinical biomarkers , 2011, BMC Medical Genomics.

[28]  Tao Jiang,et al.  A Systems Biology-Based Gene Expression Classifier of Glioblastoma Predicts Survival with Solid Tumors , 2009, PloS one.

[29]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[30]  Lili Liu,et al.  Comparative study of discretization methods of microarray data for inferring transcriptional regulatory networks , 2010, BMC Bioinformatics.

[31]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[32]  G A Whitmore,et al.  Power and sample size for DNA microarray studies , 2002, Statistics in medicine.