Analysis-oriented Metadata for Data Lakes

Data lakes are supposed to enable analysts to perform more efficient and efficacious data analysis by crossing multiple existing data sources, processes and analyses. However, it is impossible to achieve that when a data lake does not have a metadata governance system that progressively capitalizes on all the performed analysis experiments. The objective of this paper is to have an easily accessible, reusable data lake that capitalizes on all user experiences. To meet this need, we propose an analysis-oriented metadata model for data lakes. This model includes the descriptive information of datasets and their attributes, as well as all metadata related to the machine learning analyzes performed on these datasets. To illustrate our metadata solution, we implemented an application of data lake metadata management. This application allows users to find and use existing data, processes and analyses by searching relevant metadata stored in a NoSQL data store within the data lake. To demonstrate how to easily discover metadata with the application, we present two use cases, with real data, including datasets similarity detection and machine learning guidance.

[1]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[2]  Norman W. Paton,et al.  Dataset Discovery in Data Lakes , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[3]  Ian T. Foster,et al.  Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[4]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[5]  José Fco. Martínez-Trinidad,et al.  A review of unsupervised feature selection methods , 2019, Artificial Intelligence Review.

[6]  C. Maria Keet,et al.  The Data Mining OPtimization Ontology , 2015, J. Web Semant..

[7]  Franck Ravat,et al.  Data Lakes: Trends and Perspectives , 2019, DEXA.

[8]  Alberto Abelló,et al.  Towards Intelligent Data Analysis: The Metadata Challenge , 2016, IoTBD.

[9]  Saso Dzeroski,et al.  Ontology of core data mining entities , 2014, Data Mining and Knowledge Discovery.

[10]  Adriano Rivolli,et al.  Characterizing classification datasets: a study of meta-features for meta-learning. , 2018 .

[11]  Rachel Schutt,et al.  Doing Data Science , 2013 .

[12]  Toon Calders,et al.  Towards Information Profiling: Data Lake Content Metadata Management , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[13]  Seyedali Mirjalili,et al.  Approaches to Multi-Objective Feature Selection: A Systematic Literature Review , 2020, IEEE Access.

[14]  Paolo Lo Giudice,et al.  An Approach to Extracting Topic-guided Views from the Sources of a Data Lake , 2020, Inf. Syst. Frontiers.

[15]  Alon Y. Halevy,et al.  Managing Google's data lake: an overview of the Goods system , 2016, IEEE Data Eng. Bull..

[16]  Gary H. McClelland,et al.  Data Analysis: A Model Comparison Approach, Second Edition , 2008 .

[17]  Imen Megdiche,et al.  Metadata Management on Data Processing in Data Lakes , 2021, SOFSEM.

[18]  Franck Ravat,et al.  Metadata Management for Data Lakes , 2019, ADBIS.

[19]  Zachary G. Ives,et al.  Finding Related Tables in Data Lakes for Interactive Data Science , 2020, SIGMOD Conference.

[20]  Antonio González Muñoz,et al.  A Set of Complexity Measures Designed for Applying Meta-Learning to Instance Selection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[21]  Carsten Binnig,et al.  Towards Learned Metadata Extraction for Data Lakes , 2021, BTW.

[22]  Neil Foshay,et al.  Does data warehouse end-user metadata add value? , 2007, CACM.

[23]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A meta-learning recommender system for hyperparameter tuning: predicting when tuning improves SVM classifiers , 2019, Inf. Sci..

[24]  Chantal Soulé-Dupuy,et al.  Meta-mining Evaluation Framework: A Large Scale Proof of Concept on Meta-learning , 2016, Australasian Conference on Artificial Intelligence.

[25]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.