"Big Metadata": The Need for Principled Metadata Management in Big Data Ecosystems

Current big data ecosystems lack a principled approach to metadata management. This impedes large organizations' ability to share data and data preparation and analysis code, to integrate data, and to ensure that analytic code makes compatible assumptions with the data it uses. This use-case paper describes the challenges and an in-progress effort to address them. We present a real application example, discuss requirements for "big metadata" drawn from that example as well as other U.S. government analytic applications, and briefly describe an effort to adapt an existing open source metadata manager to support the needs of big data ecosystems.

[1]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[2]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[3]  Anca Vaduva,et al.  Metadata Management for Data Warehousing: An Overview , 2001, Int. J. Cooperative Inf. Syst..

[4]  Christopher Olston Graceful Logic Evolution in Web Data Processing Workflows , 2011 .

[5]  Barbara T. Blaustein,et al.  Facilitating discovery on the private web using dataset digests , 2008, Int. J. Metadata Semant. Ontologies.

[6]  Chengfei Liu,et al.  Constraint Preserving Transformation from Relational Schema to XML Schema , 2006, World Wide Web.

[7]  Dongwon Lee,et al.  Constraints-Preserving Transformation from XML Document Type Definition to Relational Schema , 2000, ER.

[8]  Craig Bonaceto,et al.  Exploring schema similarity at multiple resolutions , 2010, SIGMOD Conference.

[9]  Arnon Rosenthal,et al.  The Harmony Integration Workbench , 2008, J. Data Semant..

[10]  Ken Samuel,et al.  Facilitating discovery on the private web using dataset digests , 2010, Int. J. Metadata Semant. Ontologies.

[11]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[12]  Adriane Chapman,et al.  Provenance for collaboration: Detecting suspicious behaviors and assessing trust in information , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[13]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[14]  David Maier,et al.  When big data leads to lost data , 2012, PIKM '12.