A Zone-Based Data Lake Architecture for IoT, Small and Big Data

Data lakes are supposed to enable analysts to perform more efficient and efficacious data analysis by crossing multiple existing data sources, processes and analyses. However, it is impossible to achieve that when a data lake does not have a metadata governance system that progressively capitalizes on all the performed analysis experiments. The objective of this paper is to have an easily accessible, reusable data lake that capitalizes on all user experiences. To meet this need, we propose an analysis-oriented metadata model for data lakes. This model includes the descriptive information of datasets and their attributes, as well as all metadata related to the machine learning analyzes performed on these datasets. To illustrate our metadata solution, we implemented a web application of data lake metadata management. This application allows users to find and use existing data, processes and analyses by searching relevant metadata stored in a NoSQL data store within the data lake. To demonstrate how to easily discover metadata with the application, we present two use cases, with real data, including datasets similarity detection and machine learning guidance.

[1]  Mouzhi Ge,et al.  Big Data for Internet of Things: A Survey , 2018, Future Gener. Comput. Syst..

[2]  Imen Megdiche,et al.  Metadata Management on Data Processing in Data Lakes , 2021, SOFSEM.

[3]  Amir Masoud Rahmani,et al.  Systematic survey of big data and data mining in internet of things , 2018, Comput. Networks.

[4]  António Pereira,et al.  Big Data Analytics in IOT: Challenges, Open Research Issues and Tools , 2018, WorldCIST.

[5]  Franck Ravat,et al.  Analysis-oriented Metadata for Data Lakes , 2021, IDEAS.

[6]  H. M. Baskonus,et al.  Advances in Intelligent Systems and Computing , 2022, Smart Innovation, Systems and Technologies.

[7]  Ibrar Yaqoob,et al.  Big IoT Data Analytics: Architecture, Opportunities, and Open Research Challenges , 2017, IEEE Access.

[8]  Haruna Isah,et al.  A Big Data Lake for Multilevel Streaming Analytics , 2020, 2020 1st International Conference on Big Data Analytics and Practices (IBDAP).

[9]  Toon Calders,et al.  Towards Information Profiling: Data Lake Content Metadata Management , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[10]  Jukka Riekki,et al.  Implementing Big Data Lake for Heterogeneous Data Sources , 2019, 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW).

[11]  Yasser Abdel-Rady I. Mohamed,et al.  Data Lake Lambda Architecture for Smart Grids Big Data Analytics , 2018, IEEE Access.

[12]  Olawande Daramola,et al.  Big data stream analysis: a systematic literature review , 2019, Journal of Big Data.

[13]  Victoria L. Rubin,et al.  Veracity Roadmap: Is Big Data Objective, Truthful and Credible? , 2014 .

[14]  Franck Ravat,et al.  Data Lakes: Trends and Perspectives , 2019, DEXA.