Data Lake Governance: Towards a Systemic and Natural Ecosystem Analogy

The realm of big data has brought new venues for knowledge acquisition, but also major challenges including data interoperability and effective management. The great volume of miscellaneous data renders the generation of new knowledge a complex data analysis process. Presently, big data technologies provide multiple solutions and tools towards the semantic analysis of heterogeneous data, including their accessibility and reusability. However, in addition to learning from data, we are faced with the issue of data storage and management in a cost-effective and reliable manner. This is the core topic of this paper. A data lake, inspired by the natural lake, is a centralized data repository that stores all kinds of data in any format and structure. This allows any type of data to be ingested into the data lake without any restriction or normalization. This could lead to a critical problem known as data swamp, which can contain invalid or incoherent data that adds no values for further knowledge acquisition. To deal with the potential avalanche of data, some legislation is required to turn such heterogeneous datasets into manageable data. In this article, we address this problem and propose some solutions concerning innovative methods, derived from a multidisciplinary science perspective to manage data lake. The proposed methods imitate the supply chain management and natural lake principles with an emphasis on the importance of the data life cycle, to implement responsible data governance for the data lake.

[1]  Samir K. Srivastava,et al.  Green Supply-Chain Management: A State-of-the-Art Literature Review , 2007 .

[2]  J. Bascompte,et al.  Global change and species interactions in terrestrial ecosystems. , 2008, Ecology letters.

[3]  Jan vom Brocke,et al.  Data governance: A conceptual framework, structured review, and research agenda , 2019, Int. J. Inf. Manag..

[4]  Rajagopalan Srinivasan,et al.  Green Supply Chain Design and Operation by Integrating LCA and Dynamic Simulation , 2010 .

[5]  Anne Fleur van Veenstra,et al.  Governance of big data collaborations: How to balance regulatory compliance and disruptive innovation , 2017 .

[6]  Tadeusz Sawik,et al.  On the fair optimization of cost and customer service level in a supply chain under disruption risks , 2015 .

[7]  Assey Mbang Janvier-James,et al.  A New Introduction to Supply Chains and Supply Chain Management: Definitions and Theories Perspective , 2011 .

[8]  Mutiara Aisyah,et al.  Designing Data Governance Structure Based On Data Management Body of Knowledge (DMBOK) Framework: A Case Study on Indonesia Deposit Insurance Corporation (IDIC) , 2018, 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[9]  G. Hagelaar,et al.  Environmental Supply Chain Management: using Life Cycle Assessment to structure supply chains , 2001 .

[10]  G. Rebitzera,et al.  Life cycle assessment Part 1 : Framework , goal and scope definition , inventory analysis , and applications , 2004 .

[11]  Abigail A. Schachter,et al.  Data Governance and Data Sharing Agreements for Community-Wide Health Information Exchange: Lessons from the Beacon Communities , 2014, EGEMS.

[12]  Jérôme Darmont,et al.  On data lake architectures and metadata management , 2020, Journal of Intelligent Information Systems.

[13]  Anne Laurent,et al.  The next information architecture evolution: the data lake wave , 2016, MEDES.

[14]  R. Lewontin ‘The Selfish Gene’ , 1977, Nature.

[15]  Sascha Albers,et al.  Supply chain management in the global context , 2000 .

[16]  Chu‐Hua Kuei,et al.  Designing and Managing the Supply Chain Concepts, Strategies, and Case Studies , 2000 .

[17]  David Loshin Chapter 5 – Data Governance for Big Data Analytics: Considerations for Data Policies and Processes , 2013 .

[18]  Huang Fang Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[19]  Shi-Jie Chen,et al.  A systematic approach for supply chain improvement using design structure matrix , 2007, J. Intell. Manuf..

[20]  Jay T. Lennon,et al.  Biodiversity may regulate the temporal variability of ecological systems , 2001 .

[21]  A. Solow,et al.  Testing for Compensation in a Multi-species Community , 2007, Ecosystems.

[22]  Takehiro Sasaki,et al.  Response diversity determines the resilience of ecosystems to environmental change , 2013, Biological reviews of the Cambridge Philosophical Society.

[23]  E. Benkhelifa,et al.  Data Governance Taxonomy: Cloud versus Non-Cloud , 2018 .

[24]  Carmen Gervet,et al.  Life and Death of Data in Data Lakes: Preserving Data Usability and Responsible Governance , 2019, INSCI.

[25]  V. Daniel R. Guide,et al.  OR FORUM - The Evolution of Closed-Loop Supply Chain Research , 2009, Oper. Res..

[26]  Laura Purvis,et al.  Integrating the environmental and social sustainability pillars into the lean and agile supply chain management paradigms: A literature review and future research directions , 2018 .

[27]  Pwint Phyu Khine,et al.  Data lake: a new ideology in big data era , 2018 .

[28]  Kristin Wende,et al.  A Model for Data Governance - Organising Accountabilities for Data Quality Management , 2007 .

[29]  Natalia G. Miloslavskaya,et al.  Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[30]  John Ladley Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program , 2012 .

[31]  Juan Yebenes,et al.  Towards a Data Governance Framework for Third Generation Platforms , 2019, ANT/EDI40.

[32]  Reza Tavakkoli-Moghaddam,et al.  Solving a new bi-objective location-routing-inventory problem in a distribution network by meta-heuristics , 2014, Comput. Ind. Eng..

[33]  Franck Ravat,et al.  Data Lakes: Trends and Perspectives , 2019, DEXA.

[34]  F. Nickols STRATEGY, STRATEGIC MANAGEMENT, STRATEGIC PLANNING AND STRATEGIC THINKING , 2008 .

[35]  Douglas L. Kane,et al.  The Impact of Hydrologic Perturbations on Arctic Ecosystems Induced by Climate Change , 1997 .

[36]  J. Cacioppo,et al.  From Homeostasis to Allodynamic Regulation , 2017 .

[37]  M. Tseng,et al.  A literature review on green supply chain management: Trends and future challenges , 2019, Resources, Conservation and Recycling.

[38]  B. Beamon Supply chain design and analysis:: Models and methods , 1998 .

[39]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[40]  G. Bennett Integrated Life-Cycle and Risk Assessment for Industrial Processes , 2004 .

[41]  Yimin Zhu,et al.  An LCA-based environmental impact assessment model for regulatory planning , 2020 .

[42]  S. K. Morgan Ernest,et al.  HOMEOSTASIS AND COMPENSATION: THE ROLE OF SPECIES AND RESOURCES IN ECOSYSTEM STABILITY , 2001 .

[43]  G. Daily,et al.  Resilience and stability in bird guilds across tropical countryside , 2011, Proceedings of the National Academy of Sciences.

[44]  Xiaofan Lai,et al.  A multi-objective optimization for green supply chain network design , 2011, Decis. Support Syst..

[45]  B. Ritchie,et al.  Supply chain risk management and performance: A guiding framework for future development , 2007 .

[46]  Martin Christopher,et al.  Logistics and Supply Chain Management: Strategies for Reducing Cost and Improving Service (Second Edition) , 1999 .

[47]  J. Badenhorst-Weiss,et al.  Framework for choosing supply chain strategies , 2011 .

[48]  Jingzheng Ren,et al.  Life cycle thinking tools: Life cycle assessment, life cycle costing and social life cycle assessment , 2020 .

[49]  Željko Panian,et al.  Some Practical Experiences in Data Governance , 2010 .

[50]  SUPPLY CHAINS IN THE CONTEXT OF LIFE CYCLE ASSESSMENT AND SUSTAINABILITY , 2016 .

[51]  S. Zanni,et al.  Life cycle sustainability assessment: An ongoing journey , 2020 .

[52]  G. Zsidisin,et al.  Environmental purchasing: a framework for theory development , 2001 .

[53]  Bin He,et al.  Product sustainability assessment for product life cycle , 2019, Journal of Cleaner Production.

[54]  Frederick J. Riggins,et al.  Data governance case at KrauseMcMahon LLP in an era of self-service BI and Big Data , 2017 .