Data Lake Management: Challenges and Opportunities

The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management. PVLDB Reference Format: Fatemeh Naregsian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, Patricia C. Arocena. Data Lake Management: Challenges and Opportunities. PVLDB, 12(12): 1986-1989, 2019. DOI: https://doi.org/10.14778/3352063.3352116

[1]  Ian T. Foster,et al.  Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[2]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[3]  Phokion G. Kolaitis Schema mappings and data examples , 2011, LID '11.

[4]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[5]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[6]  Aditya G. Parameswaran,et al.  Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff , 2015, Proc. VLDB Endow..

[7]  Yue Wang,et al.  Error Diagnosis and Data Profiling with Data X-Ray , 2015, Proc. VLDB Endow..

[8]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[9]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[10]  Wolfgang Gatterbauer,et al.  Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model , 2006, AAAI.

[11]  Renée J. Miller,et al.  JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes , 2019, SIGMOD Conference.

[12]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[13]  Surajit Chaudhuri,et al.  Discovering queries based on example tuples , 2014, SIGMOD Conference.

[14]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[15]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[16]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[17]  Alexandra Roatis,et al.  CLAMS: Bringing Quality to Data Lakes , 2016, SIGMOD Conference.

[18]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[19]  David Walker,et al.  The PADS project: an overview , 2011, ICDT '11.

[20]  Joseph M. Hellerstein,et al.  Ground: A Data Context Service , 2017, CIDR.

[21]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[22]  Aditya G. Parameswaran,et al.  Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets , 2017, SIGMOD Conference.

[23]  H. V. Jagadish,et al.  Beaver: Towards a Declarative Schema Mapping , 2018, HILDA@SIGMOD.

[24]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[25]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[26]  Renée J. Miller,et al.  Open Data Integration , 2018, Proc. VLDB Endow..

[27]  Renée J. Miller,et al.  Optimizing Organizations for Navigating Data Lakes , 2018, ArXiv.

[28]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[29]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[30]  Renée J. Miller,et al.  Interactive Navigation of Open Data Linkages , 2017, Proc. VLDB Endow..

[31]  David Walker,et al.  LearnPADS + + : Incremental Inference of Ad Hoc Data Formats , 2012, PADL.

[32]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[33]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[34]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[35]  Li Qian,et al.  Sample-driven schema mapping , 2012, SIGMOD Conference.

[36]  Renée J. Miller,et al.  Making Open Data Transparent: Data Discovery on Open Data , 2018, IEEE Data Eng. Bull..

[37]  Rui Liu,et al.  Draining the Data Swamp: A Similarity-based Approach , 2018, HILDA@SIGMOD.

[38]  Eser Kandogan,et al.  LabBook: Metadata-driven social collaborative data analysis , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[39]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[40]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[41]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[42]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.

[43]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[44]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[45]  Christopher Ré,et al.  Extracting Databases from Dark Data with DeepDive , 2016, SIGMOD Conference.

[46]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.