Multi-model and network inference based on ensemble estimates: avoiding the madness of crowds

Recent progress in theoretical systems biology, applied mathematics and computational statistics allows us to compare quantitatively the performance of different candidate models at describing a particular biological system. Model selection has been applied with great success to problems where a small number — typically less than 10 — of models are compared, but recently studies have started to consider thousands and even millions of candidate models. Often, however, we are left with sets of models that are compatible with the data, and then we can use ensembles of models to make predictions. These ensembles can have very desirable characteristics, but as I show here are not guaranteed to improve on individual estimators or predictors. I will show in the cases of model selection and network inference when we can trust ensembles, and when we should be cautious. The analyses suggests that the careful construction of an ensemble – choosing good predictors – is of paramount importance, more than had perhaps been realised before: merely adding different methods does not suffice. The success of ensemble network inference methods is also shown to rest on their ability to suppress false-positive results. A Jupyter notebook which allows carrying out an assessment of ensemble estimators is provided.

[1]  Diogo M. Camacho,et al.  Next-Generation Machine Learning for Biological Networks , 2018, Cell.

[2]  Martin S. Fridson,et al.  Memoirs of Extraordinary Popular Delusions and the Madness of Crowds , 2019 .

[3]  Michael P. H. Stumpf,et al.  Statistical inference of the time-varying structure of gene-regulation networks , 2010, BMC Systems Biology.

[4]  Richard Bonneau,et al.  Biophysically motivated regulatory network inference: progress and prospects , 2016, bioRxiv.

[5]  Thalia E. Chan,et al.  Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures , 2016, bioRxiv.

[6]  H. Akaike A new look at the statistical model identification , 1974 .

[7]  Michael P H Stumpf,et al.  A Comprehensive Network Atlas Reveals That Turing Patterns Are Common but Not Robust. , 2019, Cell systems.

[8]  Hyeong Jun An,et al.  Estimating the size of the human interactome , 2008, Proceedings of the National Academy of Sciences.

[9]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[10]  Birgit Dietrich Number Theory In Science And Communication , 2016 .

[11]  D. di Bernardo,et al.  How to infer gene networks from expression profiles , 2007, Molecular systems biology.

[12]  Christopher A. Penfold,et al.  How to infer gene networks from expression profiles, revisited , 2011, Interface Focus.

[13]  Wendell A Lim,et al.  The Design Principles of Biochemical Timers: Circuits that Discriminate between Transient and Sustained Stimulation. , 2019, Cell systems.

[14]  Donna K. Slonim,et al.  Assessment of network module identification across complex diseases , 2019, Nature Methods.

[15]  Jonathan R. Karr,et al.  A Whole-Cell Computational Model Predicts Phenotype from Genotype , 2012, Cell.

[16]  Joachim M. Buhmann,et al.  Near-optimal experimental design for model selection in systems biology , 2013, Bioinform..

[17]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[18]  Johan Karlsson,et al.  Comparison of approaches for parameter identifiability analysis of biological systems , 2014, Bioinform..

[19]  Gonzalo G. de Polavieja,et al.  Improving Collective Estimations Using Resistance to Social Influence , 2015, PLoS Comput. Biol..

[20]  R. May Uses and Abuses of Mathematics in Biology , 2004, Science.

[21]  David R. Cox,et al.  PRINCIPLES OF STATISTICAL INFERENCE , 2017 .

[22]  Michael P H Stumpf,et al.  How to deal with parameters for whole-cell modelling , 2017, Journal of The Royal Society Interface.

[23]  Chris P. Barnes,et al.  A computational method for the investigation of multistable systems and its application to genetic switches , 2016, bioRxiv.

[24]  Kip S. Thorne,et al.  Modern Classical Physics: Optics, Fluids, Plasmas, Elasticity, Relativity, and Statistical Physics , 2017 .

[25]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods , 2019, Nature Biotechnology.

[26]  Xia Sheng,et al.  Bayesian design of synthetic biological systems , 2011, Proceedings of the National Academy of Sciences.

[27]  R. Baker,et al.  Mechanistic models versus machine learning, a fight worth fighting for the biological community? , 2018, Biology Letters.

[28]  M. Stumpf,et al.  Incomplete and noisy network data as a percolation process , 2010, Journal of The Royal Society Interface.

[29]  Jens Timmer,et al.  Summary of the DREAM8 Parameter Estimation Challenge: Toward Parameter Identification for Whole-Cell Models , 2015, PLoS Comput. Biol..

[30]  K. Strimmer,et al.  Inferring confidence sets of possibly misspecified gene trees , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[31]  D. Neuenschwander,et al.  Emmy Noether's Wonderful Theorem , 2010 .

[32]  Miguel A. F. Sanjuán,et al.  Modern classical physics: optics, fluids, plasmas, elasticity, relativity, and statistical physics , 2018, Contemporary Physics.

[33]  Lei Zhang,et al.  Network Topologies That Can Achieve Dual Function of Adaptation and Noise Attenuation. , 2019, Cell systems.

[34]  Christian P. Robert,et al.  The Bayesian choice : from decision-theoretic foundations to computational implementation , 2007 .

[35]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[36]  J. Stark,et al.  Network motifs: structure does not determine function , 2006, BMC Genomics.

[37]  M. Schroeder Number Theory in Science and Communication , 1984 .

[38]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[39]  W. Lim,et al.  Defining Network Topologies that Can Achieve Biochemical Adaptation , 2009, Cell.

[40]  Jung-Hsien Chiang,et al.  Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach , 2018, Scientific Reports.

[41]  Thomas Thorne,et al.  Model selection in systems and synthetic biology. , 2013, Current opinion in biotechnology.

[42]  Sarah Filippi,et al.  Information theory and signal transduction systems: from molecular information processing to network inference. , 2014, Seminars in cell & developmental biology.

[43]  Kamil Erguler,et al.  Practical limits for reverse engineering of dynamical systems: a statistical analysis of sensitivity and parameter inferability in systems biology models. , 2011, Molecular bioSystems.

[44]  Richard Bonneau,et al.  Multi-study inference of regulatory networks for more accurate models of gene regulation , 2018, bioRxiv.

[45]  William Ralph Hunter,et al.  From falling bodies to radio waves: Classical physicists and their discoveries. , 1984 .

[46]  Jens Timmer,et al.  Cause and cure of sloppiness in ordinary differential equation models. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[47]  Lude Franke,et al.  An integrative approach for building personalized gene regulatory networks for precision medicine , 2018, Genome Medicine.

[48]  Moritz Lang,et al.  Modular Parameter Identification of Biomolecular Networks , 2016, SIAM J. Sci. Comput..

[49]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[50]  David R. Anderson,et al.  Model Selection and Inference: A Practical Information-Theoretic Approach , 2001 .

[51]  Tina Toni,et al.  The ABC of reverse engineering biological signalling systems. , 2009, Molecular bioSystems.

[52]  M. Stumpf,et al.  Systems biology (un)certainties , 2015, Science.

[53]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[54]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[55]  S. Friend,et al.  Crowdsourcing biomedical research: leveraging communities as innovation engines , 2016, Nature Reviews Genetics.

[56]  Michael P. H. Stumpf,et al.  Simulation-based model selection for dynamical systems in systems and population biology , 2009, Bioinform..

[57]  Andres Laan,et al.  Rescuing Collective Wisdom when the Average Group Opinion Is Wrong , 2017, Front. Robot. AI.

[58]  E. Kandel,et al.  Proceedings of the National Academy of Sciences of the United States of America. Annual subject and author indexes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[59]  N. Novère Quantitative and logic modelling of molecular and gene networks , 2015, Nature Reviews Genetics.

[60]  Adrián López García de Lomana,et al.  Topological augmentation to infer hidden processes in biological systems , 2013, Bioinform..

[61]  Julio R. Banga,et al.  Inference of complex biological networks: distinguishability issues and optimization-based solutions , 2011, BMC Systems Biology.

[62]  Paul Kirk,et al.  Reverse Engineering Under Uncertainty , 2016 .

[63]  Christopher R. Myers,et al.  Universally Sloppy Parameter Sensitivities in Systems Biology Models , 2007, PLoS Comput. Biol..

[64]  Qi Ouyang,et al.  Identifying network topologies that can generate turing pattern. , 2016, Journal of theoretical biology.

[65]  Wendell A. Lim,et al.  The Design Principles of Biochemical Timers: Circuits That Discriminate Between Transient and Sustained Stimulation , 2017, bioRxiv.

[66]  Julio Saez-Rodriguez,et al.  Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach , 2014, BMC Systems Biology.

[67]  Michael P H Stumpf,et al.  Topological sensitivity analysis for systems biology , 2014, Proceedings of the National Academy of Sciences.

[68]  Bruce Tidor,et al.  Sloppy models, parameter uncertainty, and the role of experimental design. , 2010, Molecular bioSystems.

[69]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[70]  N. Monk,et al.  Everything flows , 2015, EMBO reports.

[71]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[72]  T. M. Murali,et al.  Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data , 2020, Nature Methods.

[73]  Heather A. Harrington,et al.  The geometry of Sloppiness , 2016, Journal of Algebraic Statistics.

[74]  G. Arfken Mathematical Methods for Physicists , 1967 .

[75]  G. Box Science and Statistics , 1976 .

[76]  Christian P. Robert,et al.  Bayesian computation: a summary of the current state, and samples backwards and forwards , 2015, Statistics and Computing.

[77]  B. Cade,et al.  Model averaging and muddled multimodel inferences. , 2015, Ecology.

[78]  Michael P. H. Stumpf,et al.  Graph spectral analysis of protein interaction network evolution , 2012, Journal of The Royal Society Interface.

[79]  Michael P. H. Stumpf,et al.  Maximizing the Information Content of Experiments in Systems Biology , 2013, PLoS Comput. Biol..