Cloud computing in the age of data-intensive science

The late Jim Gray1 described how science is becoming ‘datacentric,’ in the sense that vast quantities of public data are used to create new products. What is the cheapest and most efficient way of producing them? Should the data be uploaded to the software, or should software be built near the data? How can we curate these products and capture their processing histories? We have investigated two aspects of these broad themes.2 To examine the cost and efficiency of generating new products, we created new image sets from data uploaded to the Amazon Elastic Compute 2 (EC2) cloud and compared its performance with that of the Abe cluster at the National Center for Supercomputing Applications. We then explored the use of execution logs to capture the processing history of these products. We used the Montage image-mosaic engine3 for image processing, which creates a composite from multiple images. We computed all mosaics in three stages. First, the input images were reprojected to distribute the energy in the input pixel pattern on the sky to the output pixel pattern. Next, the sky background radiation in each reprojected image was rectified to a common level across each image. Finally, the reprojected, rectified images were co-added to produce the final mosaic. The output of one process becomes the input to the next. Thus, Montage is a data-driven workflow or pipeline application. Because it spends 95% of its time on input/output (I/O) operations, it is described as I/O-bound. We generated eight-square-degree image mosaics of the Messier 17 star-forming region based on 4GB of Two Micron All Sky Survey (2MASS) images. The workflow contained over 10,000 tasks and produced 8GB of output data. Our goal was to compare the performance of Amazon EC2 and Abe. Because the former uses commodity hardware while the latter operates on high-speed networks, we generated all mosaics on single nodes, Figure 1. Performance of Montage, Broadband, and Epigenome on various platforms. The legend identifies the processors (see Table 1). Processors designated ‘m1’ and ‘cl1’ are on the Amazon EC2 cloud, while those designated ‘abe’ are installed in the Abe cluster.

[1]  G. Bruce Berriman,et al.  The Role of Provenance Management in Accelerating the Rate of Astronomical Research , 2010, ArXiv.

[2]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[3]  Anthony J. G. Hey,et al.  Jim Gray on eScience: a transformed scientific method , 2009, The Fourth Paradigm.