Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data

Data commons collate data with cloud computing infrastructure and commonly used software services, tools, and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize, and share large-scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import, and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing, and sharing genomic data, with an emphasis on data commons, but also cover data ecosystems and data lakes.

[1]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[2]  Michael J. T. Stubbington,et al.  The Human Cell Atlas: from vision to reality , 2017, Nature.

[3]  Benjamin E. Gross,et al.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. , 2012, Cancer discovery.

[4]  Allison P. Heath,et al.  Data Commons to Support Pediatric Cancer Research. , 2017, American Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual Meeting.

[5]  G. Getz,et al.  GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers , 2011, Genome Biology.

[6]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[7]  Tudor Groza,et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species , 2016, bioRxiv.

[8]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer genes , 2014 .

[9]  Juli D. Klemm,et al.  A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine , 2017, Front. Cell Dev. Biol..

[10]  Wan Choi,et al.  Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments , 2017, bioRxiv.

[11]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[12]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[13]  John Wilbanks,et al.  First, design for data sharing , 2016, Nature Biotechnology.

[14]  Gil Alterovitz,et al.  Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results , 2017 .

[15]  Syed Haider,et al.  International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data , 2011, Database J. Biol. Databases Curation.

[16]  Anton Nekrutenko,et al.  Harnessing cloud computing with Galaxy Cloud , 2011, Nature Biotechnology.

[17]  Raja Mazumder,et al.  Biocompute Objects—A Step towards Evaluation and Validation of Biomedical Scientific Computations , 2016, PDA Journal of Pharmaceutical Science and Technology.

[18]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[19]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[20]  Andrew Carroll,et al.  Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes , 2015, PloS one.

[21]  Brian T. Lee,et al.  The UCSC Genome Browser database: 2015 update , 2014, Nucleic Acids Res..

[22]  Philip E. Bourne,et al.  The NIH Big Data to Knowledge (BD2K) initiative , 2015, J. Am. Medical Informatics Assoc..

[23]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[24]  Gerald G. Grant,et al.  Framing the Frameworks: A Review of IT Governance Research , 2005, Commun. Assoc. Inf. Syst..

[25]  Robert L. Grossman,et al.  Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets , 2014, J. Am. Medical Informatics Assoc..

[26]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[27]  Brendan W. Vaughan,et al.  The 1000 Genomes Project: data management and community access , 2012, Nature Methods.

[28]  A. Sethi,et al.  The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research. , 2017, Cancer research.

[29]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[30]  Peter N. Robinson,et al.  A Census of Disease Ontologies , 2018, Annual Review of Biomedical Data Science.

[31]  Gil Alterovitz,et al.  Enabling precision medicine via standard communication of HTS provenance, analysis, and results , 2017, bioRxiv.

[32]  Robert L Grossman,et al.  Progress Toward Cancer Data Ecosystems. , 2018, Cancer journal.

[33]  Brian Craft,et al.  The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data , 2014, Database J. Biol. Databases Curation.

[34]  Ning Ma,et al.  BLAST: a more efficient report with usability improvements , 2013, Nucleic Acids Res..

[35]  Anton Nekrutenko,et al.  Galaxy CloudMan: delivering cloud compute clusters , 2010, BMC Bioinformatics.

[36]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[37]  R. Grossman,et al.  A vision for a biomedical cloud , 2012, Journal of internal medicine.

[38]  Robert L. Grossman,et al.  A Case for Data Commons: Toward Data Science as a Service , 2016, Computing in Science & Engineering.

[39]  David L. Gibbs,et al.  The ISB Cancer Genomics Cloud: A Flexible Cloud-Based Platform for Cancer Genomics Research. , 2017, Cancer research.

[40]  R. Grossman,et al.  Data Harmonization for a Molecularly Driven Health System , 2018, Cell.

[41]  Alex Rodriguez,et al.  Experiences building Globus Genomics: a next‐generation sequencing analysis service using Galaxy, Globus, and Amazon Web Services , 2014, Concurr. Comput. Pract. Exp..

[42]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[43]  Michael Fitzsimons,et al.  Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API. , 2017, Cancer research.

[44]  Tudor Groza,et al.  The Human Phenotype Ontology in 2017 , 2016, Nucleic Acids Res..

[45]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[46]  Anthony A. Philippakis,et al.  FireCloud, a scalable cloud-based platform for collaborative genome analysis: Strategies for reducing and controlling costs , 2017, bioRxiv.

[47]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[48]  Benjamin E. Gross,et al.  Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal , 2013, Science Signaling.

[49]  James J. Cimino,et al.  Standardizing data exchange for clinical research protocols and case report forms: An assessment of the suitability of the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM) , 2015, J. Biomed. Informatics.

[50]  Allison P. Heath,et al.  Data Commons to Support Pediatric Cancer Research. , 2017, American Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual Meeting.

[51]  Kristin R. Eschenfelder,et al.  Managing the data commons: Controlled sharing of scholarly data , 2014, J. Assoc. Inf. Sci. Technol..

[52]  Geoffrey C. Fox,et al.  Comparison of Multiple Cloud Frameworks , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[53]  Michel Dumontier,et al.  A design framework and exemplar metrics for FAIRness , 2017, Scientific Data.

[54]  Joshua B. Fisher,et al.  Governing the data commons: Policy, practice, and the advancement of science , 2010, Inf. Manag..

[55]  Bartha Maria Knoppers,et al.  Framework for responsible sharing of genomic and health-related data , 2014, The HUGO Journal.

[56]  John Wilbanks,et al.  Creating a data resource: what will it take to build a medical information commons? , 2017, Genome Medicine.

[57]  Benedict Paten,et al.  The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows , 2017, F1000Research.

[58]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[59]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[60]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.