A Framework for Multitasking Data-Intensive Management Services in High Performance Computing Environments

Data management entails a continuum of tasks to develop sustainable and reusable collections throughout their lifecycle. Large collections with complex data formats and structures may require what we define as "multitasking data management," involving a combination of manual and automated iterative tasks. When conducted in a desktop computing environment by curators, these tasks can be labor-intensive and disruptive of research. While the process can be made much more efficient within a Data-Intensive High Performance Computing (DIC/HPC) infrastructure, it remains a challenge to implement generalizable services so that automated workflows can be easily performed by non-expert users. This paper introduces a framework for automating data management activities as data-intensive computing jobs within a multitasking workflow. Using as a case study a set of legacy data from an archaeological collection in need of reorganization, we identified the steps required to re-sort and move approximately 27,000 data files into a structured collection architecture. Because not all data management workflows are the same, and because there are a wide range of requirements for job submission within data-intensive HPC resources, we derived a set of generalizable modules that can be used as a guide for curators and HPC consultants. This framework may accommodate collections with different data types and data management requirements and can be conducted by curators trained in HPC usage but without ample computational expertise. Upon testing, we implemented the framework as a service on a DIC/HPC cluster.

[1]  Ritu Arora,et al.  Leveraging High Performance Computing for Managing Large and Evolving Data Collections , 2014, Int. J. Digit. Curation.

[2]  Christopher S. Oehmen,et al.  Bringing high-performance computing to the biologist's workbench: approaches, applications, and challenges , 2008 .

[3]  Dhabaleswar K. Panda,et al.  Impact of high performance sockets on data intensive applications , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[4]  K Ham,et al.  OpenRefine (version 2.5). . Free, open-source tool for cleaning and transforming data. , 2013 .

[5]  Joel H. Saltz,et al.  Processing large-scale multi-dimensional data in parallel and distributed environments , 2002, Parallel Comput..

[6]  Mehmet Balman,et al.  Stork data scheduler: mitigating the data bottleneck in e-Science , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[7]  Tevfik Kosar,et al.  A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[8]  Bill Barth,et al.  Introducing high performance computing in digital library processing workflows , 2012, JCDL '12.

[9]  Michael Johnson,et al.  Metadata Integration for an Archaeology Collection Architecture , 2014, Dublin Core Conference.

[10]  Eric H Lyons,et al.  The iPlant Collaborative , 2012 .

[11]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[12]  Maria Esteva,et al.  From the Site to Long-term Preservation: A Reflexive System to Manage and Archive Digital Archaeological Data , 2010, Archiving Conference.