Cumulus: an open source storage cloud for science

Amazon's S3 protocol has emerged as the de facto interface for storage in the commercial data cloud. However, it is closed source and unavailable to the numerous science data centers all over the country. Just as Amazon's Simple Storage Service (S3) provides reliable data cloud access to commercial users, scientific data centers must provide their users with a similar level of service. Ideally scientific data centers could allow the use of the same clients and protocols that have proven effective to Amazon's users. But how well does the S3 REST interface compare with the data cloud transfer services used in today's computational centers? Does it have the features needed to support the scientific community? If not, can it be extended to include these features without loss of compatibility? Can it scale and distribute resources equally when presented with common scientific the usage patterns? We address these questions by presenting Cumulus, an open source implementation of the Amazon S3 REST API. It is packaged with the Nimbus IaaS toolkit and provides scalable and reliable access to scientific data. Its performance compares favorably with that of GridFTP and SCP, and we have added features necessary to support the econometrics important to the scientific community.

[1]  Bogdan Nicolae,et al.  High Throughput Data-Compression for Cloud Storage , 2010, Globe.

[2]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[3]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[4]  Katarzyna Keahey,et al.  Flying Low: Simple Leases with Workspace Pilot , 2008, Euro-Par.

[5]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[6]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[7]  Adriana Iamnitchi,et al.  Filecules in High-Energy Physics: Characteristics and Impact on Resource Management , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[8]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[9]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[10]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[11]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Ian T. Foster,et al.  Virtual workspaces: Achieving quality of service and quality of life in the Grid , 2005, Sci. Program..

[13]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[14]  Steven Tuecke,et al.  GridFTP: Protocol Extensions to FTP for the Grid , 2001 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[17]  Simson L. Garfinkel,et al.  An Evaluation of Amazon's Grid Computing Services: EC2, S3, and SQS , 2007 .