Data lake concept and systems: a survey

Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional ‘schema-on-write’ approaches such as data warehouses. Data lakes have been proposed as a solution to this problem. They are repositories storing raw data in its original formats and providing a common access interface. This survey reviews the development, definition, and architectures of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey would motivate the future development of data lake research and practice.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[3]  Matthias Jarke,et al.  Designing a multi-sided data platform: findings from the International Data Spaces case , 2019, Electronic Markets.

[4]  Phokion G. Kolaitis,et al.  Structural characterizations of schema-mapping languages , 2009, ICDT '09.

[5]  Bernhard Mitschang,et al.  Modeling Data Lakes with Data Vault: Practical Experiences, Assessment, and Lessons Learned , 2019, ER.

[6]  Tim Furche,et al.  Data Wrangling for Big Data: Challenges and Opportunities , 2016, EDBT.

[7]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS '04.

[8]  Alexandra Roatis,et al.  CLAMS: Bringing Quality to Data Lakes , 2016, SIGMOD Conference.

[9]  Emanuel Sallinger,et al.  On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies , 2015, AMW.

[10]  Jignesh M. Patel,et al.  Enabling JSON Document Stores in Relational Systems , 2013, WebDB.

[11]  Ian T. Foster,et al.  Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[12]  Dimitrios Tsoumakos,et al.  MuSQLE: Distributed SQL query execution over multiple engine environments , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[13]  Ajit Singh Architecture of Data Lake , 2019 .

[14]  Cong Yu,et al.  Constraint-based XML query rewriting for data integration , 2004, SIGMOD '04.

[15]  Boualem Benatallah,et al.  Temporal Provenance Model (TPM): Model and Query Language , 2012, ArXiv.

[16]  Hassan H. Alrehamy,et al.  Personal Data Lake with Data Gravity Pull , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[17]  Huang Fang Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[18]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[19]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[20]  Laks V. S. Lakshmanan,et al.  Schema mapping and query translation in heterogeneous P2P XML databases , 2010, The VLDB Journal.

[21]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[22]  Emanuel Sallinger,et al.  Nested dependencies: structure and reasoning , 2014, PODS.

[23]  Dan Wang,et al.  Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes , 2019, ER.

[24]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[25]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[26]  Daniel E. O'Leary,et al.  Embedding AI and Crowdsourcing in the Big Data Lake , 2014, IEEE Intelligent Systems.

[27]  Christoph Quix,et al.  Nested Schema Mappings for Integrating JSON , 2018, ER.

[28]  Jérôme Darmont,et al.  Modeling Data Lake Metadata with a Data Vault , 2018, IDEAS.

[29]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[30]  Rada Chirkova,et al.  Enabling query processing across heterogeneous data models: A survey , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[31]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[32]  Riccardo Torlone,et al.  Crossing the finish line faster when paddling the Data Lake with Kayak , 2017, Proc. VLDB Endow..

[33]  Meike Klettke,et al.  Uncovering the evolution history of data lakes , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[34]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[35]  Christian Bizer,et al.  Stitching Web Tables for Improving Matching Quality , 2017, Proc. VLDB Endow..

[36]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[37]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[38]  Domenico Ursino,et al.  A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources , 2018, ADBIS.

[39]  Zachary G. Ives,et al.  Finding Related Tables in Data Lakes for Interactive Data Science , 2020, SIGMOD Conference.

[40]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.

[41]  Christian Brecher,et al.  Towards an Infrastructure Enabling the Internet of Production , 2019, 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS).

[42]  Michael Olschimke,et al.  Building a Scalable Data Warehouse with Data Vault 2.0 , 2015 .

[43]  Yannis Papakonstantinou,et al.  The SQL++ Unifying Semi-structured Query Language, and an Expressiveness Benchmark of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014 .

[44]  Domenico Ursino,et al.  An Approach to Extracting Thematic Views from Highly Heterogeneous Sources of a Data Lake , 2018, SEBD.

[45]  Aditya G. Parameswaran,et al.  Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets , 2017, SIGMOD Conference.

[46]  Kemele M. Endris,et al.  Ontario: Federated Query Processing Against a Semantic Data Lake , 2019, DEXA.

[47]  Christoph Quix,et al.  Query Rewriting for Heterogeneous Data Lakes , 2018, ADBIS.

[48]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[49]  Natalia Miloslavskaya,et al.  Big Data, Fast Data and Data Lake Concepts , 2016, BICA.

[50]  Sabrina Marczak,et al.  A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures , 2019, SEKE.

[51]  Ioana Manolescu,et al.  Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[52]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[53]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[54]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[55]  Siti Mariyam Hj. Shamsuddin,et al.  Machine Learning in Data Lake for Combining Data Silos , 2017, DMBD.

[56]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[57]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[58]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS.

[59]  Anne Laurent,et al.  The next information architecture evolution: the data lake wave , 2016, MEDES.

[60]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[61]  Raul Castro Fernandez,et al.  Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[62]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[63]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[64]  Alon Y. Halevy,et al.  Managing Google's data lake: an overview of the Goods system , 2016, IEEE Data Eng. Bull..

[65]  Simon Scerri,et al.  Querying Data Lakes using Spark and Presto , 2019, WWW.

[66]  Boualem Benatallah,et al.  CoreKG: a Knowledge Lake Service , 2018, Proc. VLDB Endow..

[67]  Sandra Geisler,et al.  An Integrated Ontology-Based Approach for Patent Classification in Medical Engineering , 2017, DILS.

[68]  Alberto Abelló,et al.  Keeping the Data Lake in Form , 2020, ACM Trans. Inf. Syst..

[69]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[70]  Paolo Papotti,et al.  Nested mappings: schema mapping reloaded , 2006, VLDB.

[71]  Reinhard Pichler,et al.  The complexity of evaluating tuple generating dependencies , 2011, ICDT '11.

[72]  Renée J. Miller,et al.  JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes , 2019, SIGMOD Conference.

[73]  Cécile Favre,et al.  Metadata Systems for Data Lakes: Models and Features , 2019, ADBIS.

[74]  Yasser Abdel-Rady I. Mohamed,et al.  Data Lake Lambda Architecture for Smart Grids Big Data Analytics , 2018, IEEE Access.

[75]  Miguel A. Martínez-Prieto,et al.  Integrating flight-related information into a (Big) data lake , 2017, 2017 IEEE/AIAA 36th Digital Avionics Systems Conference (DASC).

[76]  Tore Risch,et al.  Querying combined cloud-based and relational databases , 2011, 2011 International Conference on Cloud and Service Computing.

[77]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[78]  Renée J. Miller,et al.  Value invention in data exchange , 2013, SIGMOD '13.

[79]  Rui Liu,et al.  Draining the Data Swamp: A Similarity-based Approach , 2018, HILDA@SIGMOD.

[80]  Alberto Abelló,et al.  Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining , 2019, MEDI.

[81]  Revolucion Fundamental,et al.  LA , 2020, Les statistiques en images.

[82]  Norman W. Paton,et al.  Dataset Discovery in Data Lakes , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[83]  Jérôme Darmont,et al.  On data lake architectures and metadata management , 2020, Journal of Intelligent Information Systems.

[84]  Marcelo Arenas,et al.  The language of plain SO-tgds: Composition, inversion and structural properties , 2013, J. Comput. Syst. Sci..

[85]  Toon Calders,et al.  Towards Information Profiling: Data Lake Content Metadata Management , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[86]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[87]  Christian Mathis,et al.  SAP HANA Vora: A Distributed Computing Platform for Enterprise Data Lakes , 2017, BTW.

[88]  Erton Boci,et al.  A novel big data architecture in support of ADS-B data analytic , 2015, 2015 Integrated Communication, Navigation and Surveillance Conference (ICNS).

[89]  Christoph Quix,et al.  GEMMS: A Generic and Extensible Metadata Management System for Data Lakes , 2016, CAiSE Forum.

[90]  Renée J. Miller,et al.  Organizing Data Lakes for Navigation , 2020, SIGMOD Conference.

[91]  Toon Calders,et al.  DS-Prox: Dataset Proximity Mining for Governing the Data Lake , 2017, SISAP.

[92]  Boualem Benatallah,et al.  CoreDB: a Data Lake Service , 2017, CIKM.

[93]  Riccardo Torlone,et al.  KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake , 2018, CAiSE.

[94]  Robert Wrembel,et al.  From conceptual design to performance optimization of ETL workflows: current state of research and open problems , 2017, The VLDB Journal.

[95]  Wieslawa Gryncewicz,et al.  Agile Approach to Develop Data Lake Based Systems , 2020 .

[96]  Giuseppe Polese,et al.  Relaxed Functional Dependencies—A Survey of Approaches , 2016, IEEE Transactions on Knowledge and Data Engineering.

[97]  Christoph Quix,et al.  Rewriting of Plain SO Tgds into Nested Tgds , 2019, Proc. VLDB Endow..

[98]  Philip A. Bernstein,et al.  Composition of mappings given by embedded dependencies , 2005, PODS '05.

[99]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[100]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[101]  Paolo Papotti,et al.  Scalable data exchange with functional dependencies , 2010, Proc. VLDB Endow..

[102]  Carlo Curino,et al.  Automating the database schema evolution process , 2012, The VLDB Journal.

[103]  Yi Zhang,et al.  Dataset Relationship Management , 2019, CIDR.

[104]  Christian Mathis,et al.  Data Lakes , 2017, Datenbank-Spektrum.

[105]  Diego Calvanese,et al.  DL-Lite: Tractable Description Logics for Ontologies , 2005, AAAI.

[106]  Renée J. Miller,et al.  Open Data Integration , 2018, Proc. VLDB Endow..

[107]  Jukka Riekki,et al.  Implementing Big Data Lake for Heterogeneous Data Sources , 2019, 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW).

[108]  Jennifer Widom,et al.  The Beckman Report on Database Research , 2014, SGMD.

[109]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[110]  Matthias Jarke,et al.  On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data Integration , 2017, Conceptual Modeling Perspectives.

[111]  Beth Plale,et al.  Crossing analytics systems: A case for integrated provenance in data lakes , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).