HANDLE - A Generic Metadata Model for Data Lakes

The substantial increase in generated data induced the development of new concepts such as the data lake. A data lake is a large storage repository designed to enable flexible extraction of the data’s value. A key aspect of exploiting data value in data lakes is the collection and management of metadata. To store and handle the metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic. In this work, we present HANDLE, a generic metadata model for data lakes, which supports the flexible integration of metadata, data lake zones, metadata on various granular levels, and any metadata categorization. With these capabilities HANDLE enables comprehensive metadata management in data lakes. We show HANDLE’s feasibility through the application to an exemplary access-use-case and a prototypical implementation. A comparison with existent models yields that HANDLE can reflect the same information and provides additional capabilities needed for metadata management in data lakes.

[1]  Alon Y. Halevy,et al.  Managing Google's data lake: an overview of the Goods system , 2016, IEEE Data Eng. Bull..

[2]  Bernhard Mitschang,et al.  The Stuttgart IT Architecture for Manufacturing - An Architecture for the Data-Driven Factory , 2016, ICEIS.

[3]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[4]  Franck Ravat,et al.  Metadata Management for Data Lakes , 2019, ADBIS.

[5]  Eser Kandogan,et al.  LabBook: Metadata-driven social collaborative data analysis , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[6]  Rinkle Rani,et al.  Modeling and querying data in NoSQL databases , 2013, 2013 IEEE International Conference on Big Data.

[7]  Beth Plale,et al.  Provenance as Essential Infrastructure for Data Lakes , 2016, IPAW.

[8]  Christoph Gröger,et al.  Ganzheitliches Metadatenmanagement im Data Lake: Anforderungen, IT-Werkzeuge und Herausforderungen in der Praxis , 2019, BTW.

[9]  Dan Wang,et al.  Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes , 2019, ER.

[10]  Domenico Ursino,et al.  A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources , 2018, ADBIS.

[11]  Cécile Favre,et al.  Metadata Systems for Data Lakes: Models and Features , 2019, ADBIS.

[12]  Jérôme Darmont,et al.  Metadata Management for Textual Documents in Data Lakes , 2019, ICEIS.

[13]  Markus Spiekermann,et al.  A Metadata Model for Data Goods , 2018 .

[14]  Vasileios Theodorou,et al.  A Metadata Framework for Data Lagoons , 2019, ADBIS.

[15]  Christoph Quix,et al.  Metadata Extraction and Management in Data LakesWith GEMMS , 2016, Complex Syst. Informatics Model. Q..

[16]  Hassan H. Alrehamy,et al.  Personal Data Lake with Data Gravity Pull , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[17]  Joseph M. Hellerstein,et al.  Ground: A Data Context Service , 2017, CIDR.

[18]  Bernhard Mitschang,et al.  Leveraging the Data Lake: Current State and Challenges , 2019, DaWaK.