Big Data Bags: A Scalable Packaging Format for Science

The need to describe and exchange large and complex data underlies the vast majority of science conducted today. Such needs arise when downloading data from a repository, moving data between remote locations, exchanging data between collaborators, and even publishing data as part of the publication process. While such examples are common, it is surprisingly difficult to describe and exchange data, and it is even more difficult when datasets are large and span multiple storage locations. To address some of these challenges we proposed the Big Data Bag (BDBag) [1] as a data packaging format for representing and describing complex, distributed, and large datasets. In this presentation, we outline the BDBag model and describe three scenarios in which it is currently being used. BDBag is designed to provide a simple and convenient way of defining and describing the contents of a dataset. BDBags extend the BagIt specification [2], using it to provide basic metadata, enumerate the contents of a dataset, and as a standard packaging format for exchange. The BagIt specification specifies how data files are hierarchically named in a directory structure, includes a manifest of all of the data objects including checksums on their contents, and specifies that bags can be serialized into a ZIP file. One of the most significant advantages of the bagIt specification is that it allows files to be either included in the bag or to be directly referenced via a URL in the fetch.txt file. This allows for the exchange of ‘holey’ bags which need not contain a copy of all data resources and thus can be significantly smaller than if all files were to be included. The BagIt specification provides little guidance on how metadata should be encoded in a bag. To address the need to describe bag contents, we developed the BDBag BagIt profile [3] which specifies the use of the Research Object (RO) framework [4] for encoding metadata. The RO format allows for the description of the resources, attribution, provenance, and structured and unstructured annotations associated with resources in the bag. The BDBag BagIt profile requires that compliant bags include a RO manifest.json file that outlines for each resource (i.e., file in the data directory or remote files) the name, type, and annotations associated with that resource. Thus, BDBag users are able to specify metadata and even relationships between resources in the bag in a simple and structured manner. To simplify use of BDBags we have created both a command line client and a graphical user interface (GUI). The BDBag command line client enables creation, validation, (de)serialization (via a ZIP file), and downloading of remote files, all through a command line interface. The BDBag GUI provides a graphical user interface for working with BDBags. Users can create and update bags, and also validate, archive, and fetch remote files. The GUI is shown in Figure 1. The BDBag client and GUI are able to automatically resolve references to remote files and to download files via Globus [5], HTTP, or FTP.