Leveraging the Data Lake: Current State and Challenges

The digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a concept for more flexible and powerful data analytics. However, existing literature on data lakes is rather vague and incomplete, and the various realization approaches that have been proposed neither cover all aspects of data lakes nor do they provide a comprehensive design and realization strategy. Hence, enterprises face multiple challenges when building data lakes. To address these shortcomings, we investigate existing data lake literature and discuss various design and realization aspects for data lakes, such as governance or data models. Based on these insights, we identify challenges and research gaps concerning (1) data lake architecture, (2) data lake governance, and (3) a comprehensive strategy to realize data lakes. These challenges still need to be addressed to successfully leverage the data lake in practice.

[1]  Jay Lee,et al.  Service Innovation and Smart Analytics for Industry 4.0 and Big Data Environment , 2014 .

[2]  Jérôme Darmont,et al.  Modeling Data Lake Metadata with a Data Vault , 2018, IDEAS.

[3]  Matteo Golfarelli,et al.  Schema profiling of document-oriented databases , 2018, Inf. Syst..

[4]  Hassan H. Alrehamy,et al.  Personal Data Lake with Data Gravity Pull , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[5]  John Domingue,et al.  The Web of Data: Bridging the Skills Gap , 2014, IEEE Intelligent Systems.

[6]  Vladan Jovanovic,et al.  NoSQL document store translation to data vault based EDW , 2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[7]  Daniel E. O'Leary,et al.  Embedding AI and Crowdsourcing in the Big Data Lake , 2014, IEEE Intelligent Systems.

[8]  Huang Fang Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[9]  Topchyan Artyom,et al.  Enabling data Driven projects for a modern Enterprise , 2016 .

[10]  Christoph Quix,et al.  Metadata Extraction and Management in Data LakesWith GEMMS , 2016, Complex Syst. Informatics Model. Q..

[11]  Alberto Abelló,et al.  NOSQL Design for Analytical Workloads: Variability Matters , 2016, ER.

[12]  Alon Y. Halevy,et al.  Managing Google's data lake: an overview of the Goods system , 2016, IEEE Data Eng. Bull..

[13]  Bernhard Mitschang,et al.  BRAID - A Hybrid Processing Architecture for Big Data , 2018, DATA.

[14]  Bernhard Mitschang,et al.  The Deep Data Warehouse: Link-Based Integration and Enrichment of Warehouse Data and Unstructured Content , 2014, 2014 IEEE 18th International Enterprise Distributed Object Computing Conference.

[15]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[16]  Christoph Gröger,et al.  Ganzheitliches Metadatenmanagement im Data Lake: Anforderungen, IT-Werkzeuge und Herausforderungen in der Praxis , 2019, BTW.

[17]  Alice LaPlante,et al.  Architecting data lakes : data management architectures for advanced business use cases , 2016 .

[18]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[19]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[20]  Xavier Franch,et al.  A software reference architecture for semantic-aware Big Data systems , 2017, Inf. Softw. Technol..

[21]  Alexandra Roatis,et al.  CLAMS: Bringing Quality to Data Lakes , 2016, SIGMOD Conference.

[22]  Anne Laurent,et al.  The next information architecture evolution: the data lake wave , 2016, MEDES.

[23]  Christoph Gröger,et al.  Building an Industry 4.0 Analytics Platform , 2018, Datenbank-Spektrum.