Metadata Extraction and Management in Data LakesWith GEMMS

In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori .

[1]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[2]  Erton Boci,et al.  A novel big data architecture in support of ADS-B data analytic , 2015, 2015 Integrated Communication, Navigation and Surveillance Conference (ICNS).

[3]  RahmErhard,et al.  A survey of approaches to automatic schema matching , 2001, VLDB 2001.

[4]  Felix Naumann,et al.  XStruct: Efficient Schema Extraction from Multiple and Large XML Documents , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[5]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[6]  Oscar Pastor,et al.  Diagen: A Model-driven Framework for Integrating Bioinformatic Tools , 2011, CAiSE Forum.

[7]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[8]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[9]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[10]  Christoph Quix,et al.  Matching of Ontologies with XML Schemas Using a Generic Metamodel , 2007, OTM Conferences.

[11]  Barry Smith,et al.  On the Application of Formal Principles to Life Science Data: a Case Study in the Gene Ontology , 2004, DILS.

[12]  Martin Romacker,et al.  Evolving BioAssay Ontology (BAO): modularization, integration and applications , 2014, Journal of Biomedical Semantics.

[13]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[14]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[15]  Matthias Jarke,et al.  GeRoMe: A Generic Role Based Metamodel for Model Management , 2005, J. Data Semant..

[16]  Matthias Jarke,et al.  Interactive Pay-As-You-Go-Integration of Life Science Data: The HUMIT Approach , 2016, ERCIM News.

[17]  Panos Vassiliadis,et al.  Extraction, Transformation, and Loading , 2009, Encyclopedia of Database Systems.

[18]  Robert C. Martin Agile Software Development, Principles, Patterns, and Practices , 2002 .

[19]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[20]  Ulf Leser,et al.  Next generation data integration for Life Sciences , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[22]  Matthias Jarke,et al.  Generic schema mappings for composition and query answering , 2009, Data Knowl. Eng..

[23]  Meike Klettke,et al.  Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores , 2015, BTW.

[24]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[25]  Jens Dittrich,et al.  A Dataspace Odyssey: The iMeMex Personal Dataspace Management System (Demo) , 2007, CIDR.

[26]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.