Data Wrangling: The Challenging Yourney from the Wild to the Lake

Much has been written about the explosion of data, also known as the “data deluge”. Similarly, much of today's research and decision making are based on the de facto acceptance that knowledge and insight can be gained from analyzing and contextualizing the vast (and growing) amount of “open” or “raw” data. The concept that the large number of data sources available today facilitates analyses on combinations of heterogeneous information that would not be achievable via “siloed” data maintained in warehouses is very powerful. The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable. In this paper, we present and describe some of the challenges inherent in creating, filling, maintaining, and governing a data lake, a set of processes that collectively define the actions of data wrangling, and we propose that what is really needed is a curated data lake, where the lake contents have undergone a curation process that enable its use and deliver the promise of ad-hoc data accessibility to users beyond the enterprise IT staff.

[1]  Margaret Martin,et al.  The Bureau of Labor Statistics. , 1970 .

[2]  Inge Angevaare Taking Care of Digital Collections and Data: ‘Curation’ and Organisational Choices for Research Libraries , 2009 .

[3]  Form 10-Q SECURITIES AND EXCHANGE COMMISSION , 1985 .

[4]  Achille Fokoue,et al.  Helix: online enterprise data analytics , 2011, WWW.

[5]  Aggelos Kiayias,et al.  Security and Privacy in Digital Rights Management , 2002, Lecture Notes in Computer Science.

[6]  Alisa Surkis,et al.  Research data management. , 2015, Journal of the Medical Library Association : JMLA.

[7]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[8]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[9]  Eser Kandogan,et al.  Data for All: A Systems Approach to Accelerate the Path from Data to Insight , 2013, 2013 IEEE International Congress on Big Data.

[10]  Martin Halbert,et al.  Prospects for Research Data Management , 2013 .

[11]  Alexander S. Szalay,et al.  Online scientific data curation, publication, and archiving , 2002, SPIE Astronomical Telescopes + Instrumentation.

[12]  Laura M. Haas,et al.  The IBM Research Accelerated Discovery Lab , 2014, SGMD.

[13]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[14]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[15]  Panos Vassiliadis,et al.  A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[16]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[17]  James P. Titus,et al.  Security and Privacy , 1967, 2022 IEEE Future Networks World Forum (FNWF).

[18]  Mukesh K. Mohania,et al.  Exploiting Evidence from Unstructured Data to Enhance Master Data Management , 2012, Proc. VLDB Endow..