Big Data Curation

With the emergence of data environments with growing data variety and volume, organizations need to be supported by processes and technologies that allow them to produce and maintain high-quality data facilitating data reuse, accessibility, and analysis. In contemporary data management environments, data curation infrastructures have a key role in addressing the common challenges found across many different data production and consumption environments. Recent changes in the scale of the data landscape bring major changes and new demands to data curation processes and technologies. This chapter investigates how the emerging big data landscape is defining new requirements for data curation infrastructures and how curation infrastructures are evolving to meet these challenges. Different dimensions of scaling-up data curation for big data are described, including emerging technologies, economic models, incentive models, social aspects, and supporting standards. This analysis is grounded by literature research, interviews with domain experts, surveys, and case studies and provides an overview of the state-of-the-art, future requirements and emerging trends in the field.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Abraham Bernstein,et al.  How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users? , 2007, ISWC/ASWC.

[3]  Seán O'Riain,et al.  Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends , 2012, IEEE Internet Computing.

[4]  Li Qin,et al.  Concept-level access control for the Semantic Web , 2003, XMLSEC '03.

[5]  Henry Lieberman,et al.  Watch what I do: programming by demonstration , 1993 .

[6]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[7]  Anne E. Trefethen,et al.  UK e-Science Programme: Next Generation Grid Applications , 2004, Int. J. High Perform. Comput. Appl..

[8]  Linda C. Smith,et al.  An Educational Program on Data Curation , 2007 .

[9]  Mark Hedges,et al.  Sheer curation for experimental data and provenance , 2012, JCDL '12.

[10]  Benjamin M. Good,et al.  Games with a scientific purpose , 2011, Genome Biology.

[11]  Hugh Glaser,et al.  Linked Open Government Data: Lessons from Data.gov.uk , 2012, IEEE Intelligent Systems.

[12]  Brian McMahon Interactive publications and the record of science , 2010, Inf. Serv. Use.

[13]  Nandana Mihindukulasooriya,et al.  Rights declaration in Linked Data , 2013, COLD.

[14]  Robert Neches,et al.  Access Control Policies for Semantic Networks , 2009, 2009 IEEE International Symposium on Policies for Distributed Systems and Networks.

[15]  Amit P. Sheth,et al.  Changing Focus on Interoperability in Information Systems:From System, Syntax, Structure to Semantics , 1999 .

[16]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[17]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[18]  Hugh D. Spence,et al.  Minimum information requested in the annotation of biochemical models (MIRIAM) , 2005, Nature Biotechnology.

[19]  Edward Curry,et al.  Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers , 2012, ICIQ.

[20]  Carole L. Palmer,et al.  Foundations of Data Curation: The Pedagogy and Practice of "Purposeful Work" with Research Data , 2013 .

[21]  Seán O'Riain,et al.  A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia , 2012, WoLE@ISWC.

[22]  James Cheney,et al.  Curated databases , 2008, PODS.

[23]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[24]  M. Ashburner,et al.  Calling on a million minds for community annotation in WikiProteins , 2008, Genome Biology.

[25]  Paul Buitelaar,et al.  RelExt: A Tool for Relation Extraction from Text in Ontology Extension , 2005, SEMWEB.

[26]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[27]  Nicolas Le Novère,et al.  MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology , 2007, BMC Systems Biology.

[28]  Yolanda Gil,et al.  Mind Your Metadata: Exploiting Semantics for Configuration, Adaptation, and Provenance in Scientific Workflows , 2011, SEMWEB.

[29]  Craig A. Knoblock,et al.  Building Mashups by Demonstration , 2011, TWEB.

[30]  Seán O'Riain,et al.  Querying Linked Data Using Semantic Relatedness: A Vocabulary Independent Approach , 2011, NLDB.

[31]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[32]  Pierre Flener,et al.  An introduction to inductive programming , 2008, Artificial Intelligence Review.

[33]  Paul T. Groth,et al.  The anatomy of a nanopublication , 2010, Inf. Serv. Use.

[34]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[35]  Ray P. Norris How to Make the Dream Come True: The Astronomers' Data Manifesto , 2007, Data Sci. J..

[36]  Craig A. Knoblock,et al.  Building data integration queries by demonstration , 2007, IUI '07.

[37]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[38]  Stefan Decker,et al.  Secure Manipulation of Linked Data , 2013, SEMWEB.

[39]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[40]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[41]  Elias Bareinboim,et al.  Transportability of Causal and Statistical Relations: A Formal Approach , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[42]  André Freitas,et al.  Natural language queries over heterogeneous linked data graphs: a distributional-compositional semantics approach , 2014, IUI.

[43]  Z. Popovic,et al.  Increased Diels-Alderase activity through backbone remodeling guided by Foldit players , 2012, Nature Biotechnology.

[44]  Benjamin V. Hanrahan,et al.  VisualWikiCurator: human and machine intelligencefor organizing wiki content , 2011, IUI '11.

[45]  Z. Popovic,et al.  Crystal structure of a monomeric retroviral protease solved by protein folding game players , 2011, Nature Structural &Molecular Biology.

[46]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[47]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[48]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[49]  Lee Harland,et al.  Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery , 2009, Nature Reviews Drug Discovery.

[50]  Luis von Ahn Human Computation , 2008, ICDE.

[51]  Edward Curry,et al.  The Role of Community-Driven Data Curation for Enterprises , 2010, Linking Enterprise Data.