Centralizing data to unlock whole-cell models

Despite substantial potential to transform bioscience, medicine, and bioengineering, whole-cell models remain elusive. One of the biggest challenges to whole-cell models is assembling the large and diverse array of data needed to model an entire cell. Thanks to rapid advances in experimentation, much of the necessary data is becoming available. Furthermore, investigators are increasingly sharing their data due to growing recognition of the importance of research that is transparent and reproducible to others. However, the scattered organization of this data continues to hamper modeling. Toward more predictive models, we highlight the challenges to assembling the data needed for whole-cell modeling and outline how we can overcome these challenges by working together to build a central data warehouse.

[1]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[2]  Huaiping Zhu,et al.  The Ontario Climate Data Portal, a user-friendly portal of Ontario-specific climate projections , 2020, Scientific Data.

[3]  Matthew A. Richards,et al.  MediaDB: A Database of Microbial Growth Conditions in Defined Media , 2014, PloS one.

[4]  Oliver Purcell,et al.  Designing minimal genomes using whole-cell models , 2020, Nature Communications.

[5]  Rodrigo Lopez,et al.  The European Nucleotide Archive in 2020 , 2020, Nucleic Acids Res..

[6]  Peter D. Karp,et al.  The EcoCyc database: reflecting new knowledge about Escherichia coli K-12 , 2016, Nucleic Acids Res..

[7]  Jonathan R. Karr,et al.  A Whole-Cell Computational Model Predicts Phenotype from Genotype , 2012, Cell.

[8]  Nigel W. Hardy,et al.  The metabolomics standards initiative (MSI) , 2007, Metabolomics.

[9]  Kei-Hoi Cheung,et al.  BioPAX – A community standard for pathway data sharing , 2010, Nature Biotechnology.

[10]  Oliver Ray,et al.  Computer-Aided Whole-Cell Design: Taking a Holistic Approach by Integrating Synthetic With Systems Biology , 2020, Frontiers in Bioengineering and Biotechnology.

[11]  Chris J. Myers,et al.  Toward community standards and software for whole-cell modeling , 2016, IEEE Transactions on Biomedical Engineering.

[12]  Integrating experiments, theory and simulations into whole-cell models. , 2021, Nature methods.

[13]  David S. Wishart,et al.  The CyberCell Database (CCDB): a comprehensive, self-updating, relational database to coordinate and facilitate in silico modeling of Escherichia coli , 2004, Nucleic Acids Res..

[14]  Peter Murray-Rust,et al.  Development of chemical markup language (CML) as a system for handling complex chemical content , 2001 .

[15]  Gary D. Bader,et al.  Pathway Commons, a web resource for biological pathway data , 2010, Nucleic Acids Res..

[16]  Yan Huang,et al.  RNALocate: a resource for RNA subcellular localizations , 2016, Nucleic Acids Res..

[17]  Ilias Tagkopoulos,et al.  An integrative, multi-scale, genome-wide model reveals the phenotypic landscape of Escherichia coli , 2014, Molecular systems biology.

[18]  Carole A. Goble,et al.  RightField: embedding ontology annotation in spreadsheets , 2011, Bioinform..

[19]  Edda Klipp,et al.  SBtab: a flexible table format for data exchange in systems biology , 2016, Bioinform..

[20]  R. Aerts,et al.  East Siberian Arctic inland waters emit mostly contemporary carbon , 2020, Nature Communications.

[21]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[22]  Peter D. Karp,et al.  Construction and completion of flux balance models from pathway databases , 2012, Bioinform..

[23]  Ronan M. T. Fleming,et al.  Genome-Scale Reconstruction of Escherichia coli's Transcriptional and Translational Machinery: A Knowledge Base, Its Mathematical Formulation, and Its Functional Characterization , 2009, PLoS Comput. Biol..

[24]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[25]  U. Gophna,et al.  Harnessing the landscape of microbial culture media to predict new organism–media pairings , 2015, Nature Communications.

[26]  David S. Wishart,et al.  YMDB 2.0: a significantly expanded version of the yeast metabolome database , 2016, Nucleic Acids Res..

[27]  Rezvan Ehsani,et al.  EpiFactors: a comprehensive database of human epigenetic factors and complexes , 2015, Database J. Biol. Databases Curation.

[28]  Henning Hermjakob,et al.  Complex Portal 2018: extended content and enhanced visualization tools for macromolecular complexes , 2018, Nucleic Acids Res..

[29]  Kei-Hoi Cheung,et al.  SenseLab: new developments in disseminating neuroscience information , 2007, Briefings Bioinform..

[30]  Daniel C. Zielinski,et al.  Personalized Whole-Cell Kinetic Models of Metabolism for Discovery in Genomics and Pharmacodynamics. , 2015, Cell systems.

[31]  Paulo E. P. Burke,et al.  A biochemical network modeling of a whole-cell , 2020, Scientific Reports.

[32]  Damian Szklarczyk,et al.  Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell‐lines , 2015, Proteomics.

[33]  Javier Carrera,et al.  Why Build Whole-Cell Models? , 2015, Trends in cell biology.

[34]  Phillip A. Richmond,et al.  JASPAR 2020: update of the open-access database of transcription factor binding profiles , 2019, Nucleic Acids Res..

[35]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[36]  Arthur P. Goldberg,et al.  Toward Scalable Whole-Cell Modeling of Human Cells , 2016, SIGSIM-PADS.

[37]  Piero Fariselli,et al.  eSLDB: eukaryotic subcellular localization database , 2006, Nucleic Acids Res..

[38]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[39]  Michael P H Stumpf,et al.  How to deal with parameters for whole-cell modelling , 2017, Journal of The Royal Society Interface.

[40]  John D Westbrook,et al.  The PDB format, mmCIF, and other data formats. , 2003, Methods of biochemical analysis.

[41]  Isidro Cortes-Ciriano,et al.  Prediction of the potency of mammalian cyclooxygenase inhibitors with ensemble proteochemometric modeling , 2015, Journal of Cheminformatics.

[42]  Jonathan R. Karr,et al.  A blueprint for human whole-cell modeling. , 2018, Current opinion in systems biology.

[43]  Arthur P. Goldberg,et al.  Structured spreadsheets with ObjTables enable data reuse and integration , 2020, ArXiv.

[44]  Coby Viner,et al.  DNAmod: the DNA modification database , 2016, bioRxiv.

[45]  Kate L. White,et al.  A community approach to whole-cell modeling , 2021, Current Opinion in Systems Biology.

[46]  Matteo Barberis,et al.  GEMMER: GEnome‐wide tool for Multi‐scale Modeling data Extraction and Representation for Saccharomyces cerevisiae , 2018, Bioinform..

[47]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[48]  C. Lindskog,et al.  The human protein atlas: A spatial map of the human proteome , 2018, Protein science : a publication of the Protein Society.

[49]  Christoph Steinbeck,et al.  ChEBI in 2016: Improved services and an expanding collection of metabolites , 2015, Nucleic Acids Res..

[50]  Connor W. Coley,et al.  BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules , 2019, ACS central science.

[51]  Martin Romacker,et al.  Evolving BioAssay Ontology (BAO): modularization, integration and applications , 2014, Journal of Biomedical Semantics.

[52]  Richard Gordon,et al.  OpenWorm: overview and recent advances in integrative biological simulation of Caenorhabditis elegans , 2018, Philosophical Transactions of the Royal Society B.

[53]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[54]  Lucia Gardossi,et al.  Guidelines for reporting of biocatalytic reactions. , 2010, Trends in biotechnology.

[55]  Markus W. Covert,et al.  Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation , 2019, Science.

[56]  M. Pagni,et al.  MetaNetX/MNXref: unified namespace for metabolites and biochemical reactions in the context of metabolic models , 2020, Nucleic acids research.

[57]  Dieter Jahn,et al.  BRENDA, the ELIXIR core data resource in 2021: new developments and updates , 2020, Nucleic Acids Res..

[58]  Chao Li,et al.  CeCaFDB: a curated database for the documentation, visualization and comparative analysis of central carbon metabolic flux distributions explored by 13C-fluxomics , 2014, Nucleic Acids Res..

[59]  Gos Micklem,et al.  Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions , 2018, BMC Bioinformatics.

[60]  Pietro Coretto,et al.  An integrated quantitative structure and mechanism of action-activity relationship model of human serum albumin binding , 2019, Journal of Cheminformatics.

[61]  Carole A. Goble,et al.  SEEK: a systems biology data and model management platform , 2015, BMC Systems Biology.

[62]  Alexander G. Fletcher,et al.  MultiCellDS: a standard and a community for sharing multicellular data , 2016, bioRxiv.

[63]  Lu Sun,et al.  NCBI Taxonomy: a comprehensive update on curation, resources and tools , 2020, Database J. Biol. Databases Curation.

[64]  Matthew R. Laird,et al.  PSORTdb 4.0: expanded and redesigned bacterial and archaeal protein subcellular localization database incorporating new secondary localizations , 2020, Nucleic Acids Res..

[65]  Hongli Li,et al.  HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation , 2012, J. Chem. Inf. Model..

[66]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[67]  Derek N. Macklin,et al.  The future of whole-cell modeling. , 2014, Current opinion in biotechnology.

[68]  Griffin M. Weber,et al.  BioNumbers—the database of key numbers in molecular and cell biology , 2009, Nucleic Acids Res..

[69]  Jonathan R. Karr,et al.  Emerging whole-cell modeling principles and methods. , 2017, Current opinion in biotechnology.

[70]  Jonathan R. Karr,et al.  WholeCellKB: model organism databases for comprehensive whole-cell models , 2012, Nucleic Acids Res..

[71]  Zaida Luthey-Schulten,et al.  An in-silico human cell model reveals the influence of spatial organization on RNA splicing , 2020, PLoS Comput. Biol..

[72]  Roderic Guigo,et al.  LncATLAS database for subcellular localization of long noncoding RNAs , 2017, bioRxiv.

[73]  Nan Xu,et al.  Comprehensive understanding of Saccharomyces cerevisiae phenotypes with whole‐cell model WM_S288C , 2020, Biotechnology and bioengineering.

[74]  Wolfgang Müller,et al.  SABIO-RK: an updated resource for manually curated biochemical reaction kinetics , 2017, Nucleic Acids Res..

[75]  Nuno A. Fonseca,et al.  ArrayExpress update – from bulk to single-cell expression data , 2018, Nucleic Acids Res..

[76]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[77]  Yuji Sugita,et al.  Whole-Cell Models and Simulations in Molecular Detail. , 2019, Annual review of cell and developmental biology.

[78]  Masaru Tomita,et al.  Computational Challenges in Cell Simulation: A Software Engineering Approach , 2002, IEEE Intell. Syst..

[79]  Raymond Dalgleish,et al.  HGVS Recommendations for the Description of Sequence Variants: 2016 Update , 2016, Human mutation.

[80]  Suzanne M. Paley,et al.  The BioCyc collection of microbial genomes and metabolic pathways , 2019, Briefings Bioinform..

[81]  Minoru Kanehisa,et al.  KEGG: integrating viruses and cellular organisms , 2020, Nucleic Acids Res..

[82]  Goran Nenadic,et al.  Disentangling the multigenic and pleiotropic nature of molecular function , 2015, BMC Systems Biology.

[83]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[84]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..

[85]  Renzo Kottmann,et al.  Genomic Standards Consortium Projects , 2014, Standards in genomic sciences.

[86]  Helena Marzo-Ortega,et al.  Corrigendum: Dense genotyping of immune-related susceptibility loci reveals new insights into the genetics of psoriatic arthritis , 2015, Nature Communications.

[87]  Nadezhda T. Doncheva,et al.  The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets , 2020, Nucleic Acids Res..

[88]  M. Tomita Whole-cell simulation: a grand challenge of the 21st century. , 2001, Trends in biotechnology.

[89]  Kinetic Modeling of the Genetic Information Processes in a Minimal Cell , 2019, Front. Mol. Biosci..

[90]  Jie Liang,et al.  Challenges in structural approaches to cell modeling. , 2016, Journal of molecular biology.

[91]  Julio O. Ortiz,et al.  Noise Contributions in an Inducible Genetic Switch: A Whole-Cell Simulation Study , 2011, PLoS Comput. Biol..

[92]  Darren A. Natale,et al.  BpForms and BcForms: Tools for concretely describing non-canonical polymers and complexes to facilitate comprehensive biochemical networks , 2019 .

[93]  Statistical and computational challenges for whole cell modelling , 2021 .

[94]  Kei-Hoi Cheung,et al.  The BioPAX community standard for pathway data sharing (Nature Biotechnology (2010) 28, (935-942)) , 2012 .

[95]  E. Klipp,et al.  A comprehensive, mechanistically detailed, and executable model of the cell division cycle in Saccharomyces cerevisiae , 2018, Nature Communications.

[96]  Benjamin A. Shoemaker,et al.  PubChem in 2021: new data content and improved web interfaces , 2020, Nucleic Acids Res..

[97]  Souvik Ghosh,et al.  Global labor flow network reveals the hierarchical organization and dynamics of geo-industrial clusters , 2019, Nature Communications.

[98]  R. Gurke,et al.  Low brain endocannabinoids associated with persistent non-goal directed nighttime hyperactivity after traumatic brain injury in mice , 2020, Scientific Reports.

[99]  Lloyd M. Smith,et al.  How many human proteoforms are there? , 2018, Nature chemical biology.

[100]  Cathy H. Wu,et al.  Protein ontology on the semantic web for knowledge discovery , 2020, Scientific data.

[101]  David S. Wishart,et al.  ECMDB 2.0: A richer resource for understanding the biochemistry of E. coli , 2015, Nucleic Acids Res..

[102]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[103]  D. Matthews,et al.  Variation around the dominant viral genome sequence contributes to viral load and outcome in patients with Ebola virus disease , 2020, Genome biology.

[104]  Bonny Jain,et al.  Towards a whole-cell modeling approach for synthetic biology. , 2013, Chaos.

[105]  Jonathan R. Karr,et al.  Datanator: an integrated database of molecular data for quantitatively modeling cellular behavior , 2020, bioRxiv.

[106]  Cole H. Christie,et al.  Protein Data Bank: the single global archive for 3D macromolecular structure data , 2018, Nucleic acids research.

[107]  Jonathan R. Karr,et al.  BpForms and BcForms: a toolkit for concretely describing non-canonical polymers and complexes to facilitate global biochemical networks , 2019, Genome Biology.