META-BASE: A Novel Architecture for Large-Scale Genomic Metadata Integration

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository - already integrating several important sources - which is exposed by means of practical user interfaces to respond biological researchers' needs.

[1]  Marco Masseroli,et al.  GenoMetric Query Language: a novel approach to large-scale genomic data management , 2015, Bioinform..

[2]  Marco Masseroli,et al.  Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data , 2018, Bioinform..

[3]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[4]  Maurizio Lenzerini,et al.  Ontology-Based Search of Genomic Metadata , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Cory B. Giles,et al.  ALE: automated label extraction from GEO metadata , 2017, BMC Bioinformatics.

[6]  Stefano Paraboschi,et al.  Designing data marts for data warehouses , 2001, TSEM.

[7]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[8]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[9]  Lucila Ohno-Machado,et al.  DATS, the data tag suite to enable discoverability of datasets , 2017, Scientific Data.

[10]  Carole A. Goble,et al.  Bioschemas: From Potato Salad to Protein Annotation , 2017, SEMWEB.

[11]  Anila Sahar Butt,et al.  Where to search top-K biomedical ontologies? , 2018, Briefings Bioinform..

[12]  Thomas Lengauer,et al.  DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome region sets , 2016, Nucleic Acids Res..

[13]  Oscar Pastor,et al.  From big data to smart data: A genomic information systems perspective , 2018, 2018 12th International Conference on Research Challenges in Information Science (RCIS).

[14]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[15]  Oscar Pastor Understanding the Human Genome: A Conceptual Modeling-Based Approach - (Extended Abstract) , 2010, DEXA.

[16]  Martin J. O'Connor,et al.  Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases , 2019, Database J. Biol. Databases Curation.

[17]  Mark A. Musen,et al.  A System for Ontology-Based Annotation of Biomedical Data , 2008, DILS.

[18]  Aidong Zhang,et al.  BioStar models of clinical and genomic data for biomedical data warehouse design , 2005, Int. J. Bioinform. Res. Appl..

[19]  Pierre-Étienne Jacques,et al.  The International Human Epigenome Consortium Data Portal. , 2016, Cell systems.

[20]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[21]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[22]  Mark A. Musen,et al.  The variable quality of metadata about biological samples used in biomedical experiments , 2018, Scientific Data.

[23]  Xiaoyan Zhang,et al.  Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis , 2018, Nucleic Acids Res..

[24]  Marco Masseroli,et al.  GenoSurf: metadata driven semantic search system for integrated genomic datasets , 2019, Database J. Biol. Databases Curation.

[25]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[26]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[27]  Stefano Ceri,et al.  Ontology-driven metadata enrichment for genomic datasets , 2018, SWAT4LS.

[28]  Martin J. O'Connor,et al.  NCBO Ontology Recommender 2.0: an enhanced approach for biomedical ontology recommendation , 2016, Journal of Biomedical Semantics.

[29]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[30]  Oscar Pastor,et al.  A Method to Identify Relevant Genome Data: Conceptual Modeling for the Medicine of Precision , 2018, ER.

[31]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[32]  J. Michael Cherry,et al.  Principles of metadata organization at the ENCODE data coordination center , 2016, Database J. Biol. Databases Curation.

[33]  Simon Jupp,et al.  A new Ontology Lookup Service at EMBL-EBI , 2015, SWAT4LS.

[34]  Oscar Pastor,et al.  Applying Conceptual Modeling to Better Understand the Human Genome , 2016, ER.

[35]  Robert Stevens,et al.  Ten Simple Rules for Selecting a Bio-ontology , 2016, PLoS Comput. Biol..

[36]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[37]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[38]  Xosé M. Fernández,et al.  The 27th annual Nucleic Acids Research database issue and molecular biology database collection , 2019, Nucleic Acids Res..

[39]  Marco Masseroli,et al.  Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. , 2016, Methods.

[40]  Alessandro Campi,et al.  Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data , 2017, ER.

[41]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[42]  Lincoln D Stein,et al.  The International Cancer Genome Consortium Data Portal , 2019, Nature Biotechnology.

[43]  Stefano Ceri,et al.  Exploiting Conceptual Modeling for Searching Genomic Metadata: A Quantitative and Qualitative Empirical Study , 2019, ER Workshops.

[44]  Stefano Ceri,et al.  PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets , 2019, BMC Bioinformatics.

[45]  Chris Sander,et al.  Pathway Commons 2019 Update: integration, analysis and exploration of pathway data , 2019, Nucleic Acids Res..

[46]  Dana S. Scott,et al.  Finite Automata and Their Decision Problems , 1959, IBM J. Res. Dev..

[47]  J. Michael Cherry,et al.  The Encyclopedia of DNA elements (ENCODE): data portal update , 2017, Nucleic Acids Res..

[48]  Vassilios Ioannidis,et al.  ExPASy: SIB bioinformatics resource portal , 2012, Nucleic Acids Res..

[49]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[50]  Letizia Tanca,et al.  What you Always Wanted to Know About Datalog (And Never Dared to Ask) , 1989, IEEE Trans. Knowl. Data Eng..

[51]  Michel Dumontier,et al.  The center for expanded data annotation and retrieval , 2015, J. Am. Medical Informatics Assoc..

[52]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[53]  Pelin Yilmaz,et al.  Meta-omics data and collection objects (MOD-CO): a conceptual schema and data model for processing sample data in meta-omics research , 2019, Database J. Biol. Databases Curation.

[54]  Adrian Alexa,et al.  DNAdigest and Repositive: Connecting the World of Genomic Data , 2016, PLoS biology.

[55]  Rafael C. Jimenez,et al.  Data integration in biological research: an overview , 2015, Journal of Biological Research-Thessaloniki.