Data curation with a focus on reuse

A dataset from the field of High Performance Computing (HPC) was curated with the focus on facilitating its reuse and to appeal to a broader audience beyond HPC specialists. At an early stage in the research project, the curators gathered requirements from prospective users of the dataset, focusing on how and for which research projects they would reuse the data. Users needs informed which curation tasks to conduct, which included: adding more information elements to the dataset to expand its content scope; removing personal information; and, packaging the data in a size, a format, and at a frequency of delivery that are convenient for access and analysis purposes. The curation tasks are embedded in the software that produces the data, and are implemented as an automated workflow that spans various HPC resources, in which the dataset is generated, processed and stored and the Texas ScholarWorks institutional repository, through which the data is published. Within this distributed architecture, the integrated data creation and curation workflow complies with long-term preservation requirements, and is the first one implemented as a collaboration between the supercomputing center where the data is created on ongoing basis, and the University Libraries at UT Austin where it is published. The targeted curation strategy included the design of proof of concept data analyses to evaluate if the curated data met the reuse scenarios proposed by users. The results suggest that the dataset is understandable, and that researchers can use it to answer some of the research questions they posed. Results also pointed to specific elements of the curation strategy that had to be improved and disclosed the difficulties involved in breaking data to new users.

[1]  Ccsds Secretariat,et al.  Reference Model for an Open Archival Information System (OAIS) , 1999 .

[2]  James C. Browne,et al.  Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.

[3]  Colleen Lyon,et al.  Reducing Metadata Errors in an IR with Distributed Submission Privileges , 2015 .

[4]  Li Xiong,et al.  Distributed Anonymization: Achieving Privacy for Both Data Subjects and Data Providers , 2009, DBSec.

[5]  Kenning Arlitsch,et al.  Invisible Institutional Repositories: Addressing the Low Indexing Ratios of IRs in Google Scholar , 2012, Libr. Hi Tech.

[6]  Ann Zimmerman,et al.  Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse , 2011, Int. J. Digit. Curation.

[7]  Gregor von Laszewski,et al.  Comprehensive, open‐source resource usage measurement and analysis for HPC systems , 2014, Concurr. Comput. Pract. Exp..

[8]  Mike Thelwall,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[9]  Klaus Graf Open Access Tracking Project , 2013 .

[10]  Ruth E. Duerr,et al.  Achieving human and machine accessibility of cited data in scholarly publications , 2015, PeerJ Comput. Sci..

[11]  Eswaran Subrahmanian,et al.  Sustaining Engineering Informatics: Toward Methods and Metrics for Digital Curation , 2008, Int. J. Digit. Curation.

[12]  Sarah Higgins,et al.  The dcc curation lifecycle model , 2008, JCDL '08.

[13]  Terry Cook,et al.  Archives, records, and power: The making of modern memory , 2002 .

[14]  C. Rusbridge,et al.  The International Journal of Digital Curation , 2008 .

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Reuben D. Budiardja,et al.  Tales from the trenches: can user support tools make a difference? , 2015, HUST '15.

[17]  Mark R. Fahey,et al.  User Environment Tracking and Problem Detection with XALT , 2014, 2014 First International Workshop on HPC User Support Tools.

[18]  David L. Giaretta The CASPAR Approach to Digital Preservation , 2007, Int. J. Digit. Curation.

[19]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[20]  Ethan P. White,et al.  Nine simple ways to make it easier to (re)use your data , 2013 .

[21]  Michael Johnson,et al.  Metadata Integration for an Archaeology Collection Architecture , 2014, Dublin Core Conference.

[22]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[23]  Francine Berman,et al.  Grid Computing: Making the Global Infrastructure a Reality , 2003 .

[24]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[25]  Paul T. Groth,et al.  Ten Simple Rules for the Care and Feeding of Scientific Data , 2014, PLoS Comput. Biol..

[26]  Yvonne M. Socha,et al.  OUT OF CITE, OUT OF MIND: THE CURRENT STATE OF PRACTICE, POLICY, AND TECHNOLOGY FOR THE CITATION OF DATA CODATA-ICSTI Task Group on Data Citation Standards and Practices , 2013 .

[27]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[28]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .