Causal inference and prior integration in bioinformatics using information theory

An important problem in bioinformatics is the reconstruction of gene regulatory networks from expression data. The analysis of genomic data stemming from high- throughput technologies such as microarray experiments or RNA-sequencing faces several difficulties. The first major issue is the high variable to sample ratio which is due to a number of factors: a single experiment captures all genes while the number of experiments is restricted by the experiment’s cost, time and patient cohort size. The second problem is that these data sets typically exhibit high amounts of noise.Another important problem in bioinformatics is the question of how the inferred networks’ quality can be evaluated. The current best practice is a two step procedure. In the first step, the highest scoring interactions are compared to known interactions stored in biological databases. The inferred networks passes this quality assessment if there is a large overlap with the known interactions. In this case, a second step is carried out in which unknown but high scoring and thus promising new interactions are validated ’by hand’ via laboratory experiments. Unfortunately when integrating prior knowledge in the inference procedure, this validation procedure would be biased by using the same information in both the inference and the validation. Therefore, it would no longer allow an independent validation of the resulting network.The main contribution of this thesis is a complete computational framework that uses experimental knock down data in a cross-validation scheme to both infer and validate directed networks. Its components are i) a method that integrates genomic data and prior knowledge to infer directed networks, ii) its implementation in an R/Bioconductor package and iii) a web application to retrieve prior knowledge from PubMed abstracts and biological databases. To infer directed networks from genomic data and prior knowledge, we propose a two step procedure: First, we adapt the pairwise feature selection strategy mRMR to integrate prior knowledge in order to obtain the network’s skeleton. Then for the subsequent orientation phase of the algorithm, we extend a criterion based on interaction information to include prior knowledge. The implementation of this method is available both as part of the prior retrieval tool Predictive Networks and as a stand-alone R/Bioconductor package named predictionet.Furthermore, we propose a fully data-driven quantitative validation of such directed networks using experimental knock-down data: We start by identifying the set of genes that was truly affected by the perturbation experiment. The rationale of our validation procedure is that these truly affected genes should also be part of the perturbed gene’s childhood in the inferred network. Consequently, we can compute a performance score

[1]  L. Stein,et al.  A human functional protein interaction network and its application to cancer data analysis , 2010, Genome Biology.

[2]  P. Brown,et al.  New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. , 2000, Molecular biology of the cell.

[3]  Christopher Meek,et al.  Causal inference and causal explanation with background knowledge , 1995, UAI.

[4]  P. Brazhnik,et al.  Gene networks: how to put the function in genomics. , 2002, Trends in biotechnology.

[5]  Robert Castelo,et al.  A Robust Procedure For Gaussian Graphical Model Search From Microarray Data With p Larger Than n , 2006, J. Mach. Learn. Res..

[6]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[7]  Dario Floreano,et al.  Generating Realistic In Silico Gene Networks for Performance Assessment of Reverse Engineering Methods , 2009, J. Comput. Biol..

[8]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[9]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[10]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[11]  Adam A. Margolin,et al.  Reverse engineering cellular networks , 2006, Nature Protocols.

[12]  Arnaud Doucet,et al.  A boosting approach to structure learning of graphs with and without prior knowledge , 2009, Bioinform..

[13]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[14]  David A. Bell,et al.  Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[15]  Shoshana J. Wodak,et al.  Combining pattern discovery and discriminant analysis to predict gene co-regulation , 2004, Bioinform..

[16]  Yoshihiro Yamanishi,et al.  Supervised Graph Inference , 2004, NIPS.

[17]  Tsuyoshi Kato,et al.  Selective integration of multiple biological data for supervised network inference , 2005, Bioinform..

[18]  Ben Taskar,et al.  Graphical Models in a Nutshell , 2007 .

[19]  Giorgos Borboudakis,et al.  Incorporating Causal Prior Knowledge as Path-Constraints in Bayesian Networks and Maximal Ancestral Graphs , 2012, ICML.

[20]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[21]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[23]  Judea Pearl,et al.  A Theory of Inferred Causation , 1991, KR.

[24]  T. Brutnell,et al.  Exploring plant transcriptomes using ultra high-throughput sequencing. , 2010, Briefings in functional genomics.

[25]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[26]  A. J. Bell THE CO-INFORMATION LATTICE , 2003 .

[27]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[28]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[29]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[30]  Constantin F. Aliferis,et al.  A Novel Algorithm for Scalable and Accurate Bayesian Network Learning , 2004, MedInfo.

[31]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[32]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[33]  Deepak Sebastian,et al.  Introduction (Section-3) , 2005 .

[34]  Tian Zheng,et al.  Inference of Regulatory Gene Interactions from Expression Data Using Three‐Way Mutual Information , 2009, Annals of the New York Academy of Sciences.

[35]  Mehmet A. Orgun,et al.  Using dynamic bayesian networks to infer gene regulatory networks from expression profiles , 2009, SAC '09.

[36]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[37]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[38]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[39]  P. Geurts,et al.  Inferring Regulatory Networks from Expression Data Using Tree-Based Methods , 2010, PloS one.

[40]  Kevin Kontos,et al.  Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction , 2009 .

[41]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[42]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[43]  P. Bühlmann,et al.  Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana , 2004, Genome Biology.

[44]  Isabelle Guyon,et al.  Design and Analysis of the Causation and Prediction Challenge , 2008, WCCI Causation and Prediction Challenge.

[45]  John Quackenbush,et al.  Seeded Bayesian Networks: Constructing genetic networks from microarray data , 2008, BMC Systems Biology.

[46]  Frank Emmert-Streib,et al.  Influence of Statistical Estimators of Mutual Information and Data Heterogeneity on the Inference of Gene Regulatory Networks , 2011, PloS one.

[47]  Korbinian Strimmer,et al.  From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data , 2007, BMC Systems Biology.

[48]  D. Botstein,et al.  Systematic changes in gene expression patterns following adaptive evolution in yeast. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[50]  D. Margaritis Learning Bayesian Network Model Structure from Data , 2003 .

[51]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[52]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[53]  Patrick E. Meyer Information-Theoretic Variable Selection and Network Inference from Microarray Data , 2008 .

[54]  Yong Li,et al.  ReTRN: a retriever of real transcriptional regulatory network and expression data for evaluating structure learning algorithm. , 2009, Genomics.

[55]  P. Bühlmann,et al.  Statistical Applications in Genetics and Molecular Biology Low-Order Conditional Independence Graphs for Inferring Genetic Networks , 2011 .

[56]  Bart De Moor,et al.  A Framework for Elucidating Regulatory Networks Based on Prior Information and Expression Data , 2007, Annals of the New York Academy of Sciences.

[57]  S. Ahmed,et al.  Bayesian Networks and Decision Graphs (2nd ed.), by F. V. Jenson and T. D. Nielsen , 2008 .

[58]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[59]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[60]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[61]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[62]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[63]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[64]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[65]  H. Quastler Information theory in psychology : problems and methods , 1955 .

[66]  Gianluca Bontempi,et al.  Fourier spectral factor model for prediction of multidimensional signals , 2011, Signal Process..

[67]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[68]  André Elisseeff,et al.  Using Markov Blankets for Causal Structure Learning , 2008, J. Mach. Learn. Res..

[69]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[70]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[71]  Geoffrey I. Webb,et al.  On Why Discretization Works for Naive-Bayes Classifiers , 2003, Australian Conference on Artificial Intelligence.

[72]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[73]  Mtw,et al.  Computation, causation, and discovery , 2000 .

[74]  Gianluca Bontempi,et al.  Transcriptional Network Inference Based on Information Theory , 2011 .

[75]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[76]  Sorin Drăghici,et al.  Data Analysis Tools for DNA Microarrays , 2003 .

[77]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[78]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[79]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[80]  Kevin Kontos,et al.  Information-Theoretic Inference of Large Transcriptional Regulatory Networks , 2007, EURASIP J. Bioinform. Syst. Biol..

[81]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[82]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[83]  Kevin Kontos,et al.  An improved shrinkage estimator to infer regulatory networks with Gaussian graphical models , 2009, SAC '09.

[84]  R. Dykstra Establishing the Positive Definiteness of the Sample Covariance Matrix , 1970 .

[85]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[86]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[87]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[88]  Robert W F Vitral,et al.  The null hypothesis. , 2013, American journal of orthodontics and dentofacial orthopedics : official publication of the American Association of Orthodontists, its constituent societies, and the American Board of Orthodontics.

[89]  Judea Pearl,et al.  Equivalence and Synthesis of Causal Models , 1990, UAI.

[90]  Yi-Fei Pu,et al.  Using gene expression programming to infer gene regulatory networks from time-series data , 2013, Comput. Biol. Chem..

[91]  G. Altay,et al.  Structural influence of gene networks on their inference: analysis of C3NET. , 2011 .

[92]  Arno Siebes,et al.  REPORT RAPPORT , 2022 .

[93]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[94]  Gianluca Bontempi,et al.  On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information , 2008, EURASIP J. Bioinform. Syst. Biol..

[95]  Tom Heskes,et al.  A Bayesian Approach to Constraint Based Causal Inference , 2012, UAI.

[96]  Claudio Cobelli,et al.  A Gene Network Simulator to Assess Reverse Engineering Algorithms , 2009, Annals of the New York Academy of Sciences.

[97]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[98]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[99]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[100]  Pierre Geurts,et al.  Bias vs Variance Decomposition for Regression and Classification , 2005, Data Mining and Knowledge Discovery Handbook.

[101]  D. di Bernardo,et al.  How to infer gene networks from expression profiles , 2007, Molecular systems biology.

[102]  Dirk Husmeier,et al.  Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks with Bayesian networks. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[103]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[104]  Heather J. Ruskin,et al.  RNA-Seq vs Dual- and Single-Channel Microarray Data: Sensitivity Analysis for Differential Expression and Clustering , 2012, PloS one.

[105]  N. Wermuth,et al.  Linear Dependencies Represented by Chain Graphs , 1993 .

[106]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[107]  J. Victor Binless strategies for estimation of information from neural data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[108]  Gianluca Bontempi,et al.  On the impact of entropy estimator in transcriptional regulatory network inference , 2008 .

[109]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[110]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[111]  Bill Shipley,et al.  Cause and Correlation in Biology: A User''s Guide to Path Analysis , 2016 .

[112]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[113]  Benjamin Haibe-Kains,et al.  On the Integration of Prior Knowledge in the Inference of Regulatory Networks. , 2013 .

[114]  A. Leibovitz,et al.  Classification of human colorectal adenocarcinoma cell lines. , 1976, Cancer research.

[115]  D. Floreano,et al.  Revealing strengths and weaknesses of methods for gene network inference , 2010, Proceedings of the National Academy of Sciences.

[116]  Benjamin Haibe-Kains,et al.  mRMRe: an R package for parallelized mRMR ensemble feature selection , 2013, Bioinform..

[117]  Cristian Del Fabbro,et al.  Comparative study of RNA-seq- and Microarray-derived coexpression networks in Arabidopsis thaliana , 2013, Bioinform..

[118]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[119]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[120]  Jeffrey T. Chang,et al.  Oncogenic pathway signatures in human cancers as a guide to targeted therapies , 2006, Nature.

[121]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[122]  Liang Wu,et al.  Classifying n-back EEG data using entropy and mutual information features , 2007, ESANN.

[123]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[124]  D. Edwards Introduction to graphical modelling , 1995 .

[125]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[126]  Kevin Kontos,et al.  Nested q-Partial Graphs for Genetic Network Inference from "Small n, Large p" Microarray Data , 2008, BIRD.

[127]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.

[128]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[129]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[130]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[131]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[132]  Gabriel Capellà,et al.  Genetic instability and divergence of clonal populations in colon cancer cells in vitro , 2006, Journal of Cell Science.

[133]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[134]  Kiyoshi Asai,et al.  The em Algorithm for Kernel Matrix Completion with Auxiliary Data , 2003, J. Mach. Learn. Res..

[135]  Peter Grassberger,et al.  Entropy estimation of symbol sequences. , 1996, Chaos.

[136]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[137]  Frank Emmert-Streib,et al.  Bagging Statistical Network Inference from Large-Scale Gene Expression Data , 2012, PloS one.

[138]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[139]  D. Anastassiou Computational analysis of the synergy among multiple interacting genes , 2007, Molecular systems biology.

[140]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[141]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[142]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[143]  Richard Bonneau,et al.  Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks , 2013, Bioinform..

[144]  Snigdhansu Chatterjee,et al.  Causality and pathway search in microarray time series experiment , 2007, Bioinform..

[145]  Benjamin Haibe-Kains,et al.  Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks , 2011, Nucleic Acids Res..

[146]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[147]  D. Hedley,et al.  Raf kinase as a target for anticancer therapeutics , 2005, Molecular Cancer Therapeutics.

[148]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[149]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[150]  T. Elston,et al.  Stochasticity in gene expression: from theories to phenotypes , 2005, Nature Reviews Genetics.

[151]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[152]  Gianluca Bontempi,et al.  Inferring causal relationships using information-theoretic measures , 2009 .

[153]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[154]  N. Wermuth,et al.  On Substantive Research Hypotheses, Conditional Independence Graphs and Graphical Chain Models , 1990 .

[155]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[156]  Gianluca Bontempi,et al.  Causal filter selection in microarray data , 2010, ICML.

[157]  J. L. Bos,et al.  ras oncogenes in human cancer: a review. , 1989, Cancer research.

[158]  C. Granger Investigating Causal Relations by Econometric Models and Cross-Spectral Methods , 1969 .

[159]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[160]  J. Tegnér,et al.  Perturbations to uncover gene networks. , 2007, Trends in genetics : TIG.

[161]  Frank Emmert-Streib,et al.  Inferring the conservative causal core of gene regulatory networks , 2010, BMC Systems Biology.

[162]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[163]  M. Kanehisa,et al.  Graph-driven features extraction from microarray data , 2002, physics/0206055.

[164]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[165]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[166]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[167]  Benjamin Haibe-Kains,et al.  predictionet: Inference for predictive networks designed for (but not limited to) genomic data: R/Bioconductor , 2012 .

[168]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.