Kurator: A Kepler Package for Data Curation Workflows

Abstract Data curation is critical for scientific data digitization, sharing, integration, and use. This paper presents Kurator, a software package for automating data curation pipelines in the Kepler scientific workflow system. Several curation tools and services are integrated into this package as actors to enable construction of workflows to perform and document various data curation tasks. The integration of Google cloud services (e.g., Google spreadsheets), allows workflow steps to invoke human experts outside the workflow in a manner that greatly simplifies the complex data handling in distributed, multi-user curation workflows. The Kepler platform provides the modeling, execution and management ability, including a collection-oriented model of computation (COMAD), and provenance tracking and browsing for the curation package. These features not only allow workflows to be easily modeled, maintained, and evolved, but also QA/QC of curation results is facilitated through examination of provenance information recorded during workflow execution. Effectiveness of the Kurator package is demonstrated through a workflow for data curation of natural science collections.

[1]  Bertram Ludäscher,et al.  Provenance browser: Displaying and querying scientific workflow provenance graphs , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[2]  Jing Hua,et al.  A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows , 2009, 2009 IEEE International Conference on Services Computing.

[3]  R. Gilmour The International Plant Names Index , 2013 .

[4]  Bertram Ludäscher,et al.  Scientific workflow design with data assembly lines , 2009, WORKS '09.

[5]  Bertram Ludäscher,et al.  Improving Workflow Fault Tolerance through Provenance-Based Recovery , 2011, SSDBM.

[6]  Bertram Ludäscher,et al.  XML-based computation for scientific workflows , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  Zhimin Wang,et al.  Filtered-Push: A Map-Reduce Platform for Collaborative Taxonomic Data Management , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[8]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[9]  Bertram Ludäscher,et al.  Scientific workflow design 2.0: Demonstrating streaming data collections in Kepler , 2011, 2011 IEEE 27th International Conference on Data Engineering.