Serverless Workflows for Indexing Large Scientific Data

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific "data lakes" quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.

[1]  Kyle Chard,et al.  A data ecosystem to support machine learning in materials science , 2019, MRS Communications.

[2]  John Kunze,et al.  DataONE: Data Observation Network for Earth - Preserving Data and Enabling Innovation in the Biological and Environmental Sciences , 2011, D Lib Mag..

[3]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[4]  I. Foster,et al.  The Materials Data Facility: Data Services to Advance Materials Science Research , 2016, JOM.

[5]  Ian T. Foster,et al.  Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[6]  Ian T. Foster,et al.  Jetstream: a self-provisioned, scalable science and engineering cloud environment , 2015, XSEDE.

[7]  Gary King,et al.  An Introduction to the Dataverse Network as an Infrastructure for Data Sharing , 2007 .

[8]  Mary Roth,et al.  Data Wrangling: The Challenging Yourney from the Wild to the Lake , 2015, CIDR.

[9]  Ian T. Foster,et al.  Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data , 2016, 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS).

[10]  Lavanya Ramakrishnan,et al.  ScienceSearch: Enabling Search through Automatic Metadata Generation , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[11]  Ian T. Foster,et al.  Big Data Remote Access Interfaces for Light Source Science , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[12]  Steven Tuecke,et al.  Serverless Supercomputing: High Performance Function as a Service for Science , 2019, ArXiv.

[13]  Arthur W. Toga,et al.  BDQC: a general-purpose analytics tool for domain-blind validation of Big Data , 2018, bioRxiv.

[14]  Tyler J. Skluzacek Dredging a data lake: decentralized metadata extraction , 2019, Middleware Doctoral Symposium.

[15]  Rui Liu,et al.  Brown Dog: Leveraging everything towards autocuration , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Ian T. Foster,et al.  Globus auth: A research identity and access management platform , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[17]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[18]  Jukka Zitting,et al.  Tika in Action , 2011 .

[19]  Ian Foster,et al.  Parsl: Pervasive Parallel Programming in Python , 2019, HPDC.

[20]  Ian T. Foster,et al.  Globus Platform Services for Data Publication , 2018, PEARC.