Structured spreadsheets with ObjTables enable data reuse and integration

A central challenge in science is to understand how behaviors emerge from complex networks. This often requires reusing and integrating heterogeneous information. Supplementary spreadsheets to journal articles are a key data source. Spreadsheets are popular because they are easy to read and write. However, spreadsheets are often difficult to reanalyze because they capture data ad hoc without schemas that outline the objects, relationships, and attributes that they represent. To help researchers reuse and compose spreadsheets, we developed ObjTables, a toolkit that structures human-readable spreadsheets with schemas. ObjTables includes a format for schemas, a markup language for indicating the class and attribute represented by each spreadsheet and column, and software for using schemas to read, write, validate, compare, merge, split, revision, and analyze spreadsheets. ObjTables supports a wide range of data types. By making spreadsheets easier to reuse, ObjTables could enable unprecedented secondary meta-analyses. By making it easy to build new formats and associated software for new types of data, ObjTables can also accelerate emerging scientific fields.

[1]  Nan Xiao,et al.  Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli , 2008, Bioinform..

[2]  Adam M. Feist,et al.  A comprehensive genome-scale reconstruction of Escherichia coli metabolism—2011 , 2011, Molecular systems biology.

[3]  Ernst Dieter Gilles,et al.  Thermodynamic Constraints in Kinetic Modeling: Thermodynamic‐Kinetic Modeling in Comparison to Other Approaches , 2008 .

[4]  Grenville J. Croll The Importance and Criticality of Spreadsheets in the City of London , 2007, ArXiv.

[5]  Christoph Steinbeck,et al.  ChEBI in 2016: Improved services and an expanding collection of metabolites , 2015, Nucleic Acids Res..

[6]  C. Maranas,et al.  A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains , 2016, Nature Communications.

[7]  Nigel W. Hardy,et al.  The first RSBI (ISA-TAB) workshop: "can a simple format work for complex studies?". , 2008, Omics : a journal of integrative biology.

[8]  Jonathan R. Karr,et al.  A Whole-Cell Computational Model Predicts Phenotype from Genotype , 2012, Cell.

[9]  Aleksandra Nenadic,et al.  TeSS: a platform for discovering life-science training opportunities , 2020, Bioinform..

[10]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[11]  Jonathan P. Caulkins,et al.  Spreadsheet Errors and Decision Making: Evidence from Field Interviews , 2007, J. Organ. End User Comput..

[12]  U. Sauer,et al.  Pseudo-transition Analysis Identifies the Key Regulators of Dynamic Metabolic Adaptations from Steady-State Data. , 2015, Cell systems.

[13]  Alan Ruttenberg,et al.  The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability , 2016, J. Biomed. Semant..

[14]  Hedi Peterson,et al.  The bio.tools registry of software tools and data resources for the life sciences , 2019, Genome Biology.

[15]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[16]  Paul T. Spellman,et al.  A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB , 2006, BMC Bioinformatics.

[17]  Paul Walsh,et al.  Frictionless Data: Making Research Data Quality Visible , 2018, Int. J. Digit. Curation.

[18]  Abdelmoneim Amer Desouki,et al.  Algorithms for Improving the Predictive Power of Flux Balance Analysis , 2016 .

[19]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[20]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[21]  Jerzy Tyszkiewicz Spreadsheet as a relational database engine , 2010, SIGMOD Conference.

[22]  Jonathan R. Karr,et al.  Emerging whole-cell modeling principles and methods. , 2017, Current opinion in biotechnology.

[23]  Raymond Dalgleish,et al.  HGVS Recommendations for the Description of Sequence Variants: 2016 Update , 2016, Human mutation.

[24]  Bernd Rinn,et al.  FAIRDOMHub: a repository and collaboration environment for sharing systems biology research , 2016, Nucleic Acids Res..

[25]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[26]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[27]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[28]  Jácome Cunha,et al.  From spreadsheets to relational databases and back , 2009, PEPM '09.

[29]  Darren A. Natale,et al.  BpForms and BcForms: Tools for concretely describing non-canonical polymers and complexes to facilitate comprehensive biochemical networks , 2019 .

[30]  B. Palsson,et al.  Regulation of gene expression in flux balance models of metabolism. , 2001, Journal of theoretical biology.

[31]  Andy R. Terrel,et al.  SymPy: Symbolic computing in Python , 2017, PeerJ Prepr..

[32]  A. El-Osta,et al.  Gene name errors are widespread in the scientific literature , 2016, Genome Biology.

[33]  Andreas Hoppe,et al.  Including metabolite concentrations into flux balance analysis: thermodynamic realizability as a constraint on flux distributions in metabolic networks , 2007, BMC Systems Biology.

[34]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[35]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[36]  Rainer Breitling,et al.  IDEOM: an Excel interface for analysis of LC-MS-based metabolomics data , 2012, Bioinform..

[37]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[38]  Tomer Shlomi,et al.  Prediction of Microbial Growth Rate versus Biomass Yield by a Metabolic Network with Kinetic Parameters , 2012, PLoS Comput. Biol..

[39]  Todd Vision The Dryad Digital Repository: Published evolutionary data as part of the greater data ecosystem , 2010 .

[40]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[41]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[42]  Jeffrey D Orth,et al.  What is flux balance analysis? , 2010, Nature Biotechnology.

[43]  Kevin Chen-Chuan Chang,et al.  DataSpread: Unifying Databases and Spreadsheets , 2015, Proc. VLDB Endow..

[44]  Edda Klipp,et al.  SBtab: a flexible table format for data exchange in systems biology , 2016, Bioinform..