Modeling Data Lakes with Data Vault: Practical Experiences, Assessment, and Lessons Learned

Data lakes have become popular to enable organization-wide analytics on heterogeneous data from multiple sources. Data lakes store data in their raw format and are often characterized as schema-free. Nevertheless, it turned out that data still need to be modeled, as neglecting data modeling may lead to issues concerning e.g., quality and integration. In current research literature and industry practice, Data Vault is a popular modeling technique for structured data in data lakes. It promises a flexible, extensible data model that preserves data in their raw format. However, hardly any research or assessment exist on the practical usage of Data Vault for modeling data lakes. In this paper, we assess the Data Vault model’s suitability for the data lake context, present lessons learned, and investigate success factors for the use of Data Vault. Our discussion is based on the practical usage of Data Vault in a large, global manufacturer’s data lake and the insights gained in real-world analytics projects.

[1]  Dani Schnider,et al.  Comparison of Data Modeling Methods for a Core Data Warehouse , 2014 .

[2]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[3]  Zoran Marjanovic,et al.  A direct approach to physical Data Vault design , 2014, Comput. Sci. Inf. Syst..

[4]  M. Porter Competitive Advantage: Creating and Sustaining Superior Performance , 1985 .

[5]  Bernhard Mitschang,et al.  The Deep Data Warehouse: Link-Based Integration and Enrichment of Warehouse Data and Unstructured Content , 2014, 2014 IEEE 18th International Enterprise Distributed Object Computing Conference.

[6]  Christoph Gröger,et al.  Building an Industry 4.0 Analytics Platform , 2018, Datenbank-Spektrum.

[7]  Vladan Jovanovic,et al.  Conceptual Data Vault Model , 2012 .

[8]  Hassan H. Alrehamy,et al.  Personal Data Lake with Data Gravity Pull , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[9]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[10]  Vladan Jovanovic,et al.  NoSQL document store translation to data vault based EDW , 2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[11]  Topchyan Artyom,et al.  Enabling data Driven projects for a modern Enterprise , 2016 .

[12]  Christian Mathis,et al.  Data Lakes , 2017, Datenbank-Spektrum.

[13]  Lamia Yessad,et al.  Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault , 2016, 2016 International Conference on System Reliability and Science (ICSRS).

[14]  W. H. Inmon,et al.  Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault , 2014 .

[15]  Huang Fang Managing data lakes in big data era: What's a data lake and why has it became popular in data management ecosystem , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[16]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling , 2013 .

[17]  Michael Olschimke,et al.  Building a Scalable Data Warehouse with Data Vault 2.0 , 2015 .

[18]  Laurian M. Chirica,et al.  The entity-relationship model: toward a unified view of data , 1975, SIGF.