On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage

Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.

[1]  Paul Marshall,et al.  Elastic Site: Using Clouds to Elastically Extend Site Resources , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[2]  David A. Patterson,et al.  Technical perspective: the data center is the computer , 2008, CACM.

[3]  Christopher Chute,et al.  The Diverse and Exploding Digital Universe , 2011 .

[4]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[5]  Gabriel Antoniu,et al.  Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach , 2009, Euro-Par.

[6]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[7]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .

[9]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[10]  Chandra Krintz,et al.  Adaptive on-the-fly compression , 2006, IEEE Transactions on Parallel and Distributed Systems.

[11]  Jesús Montes,et al.  Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[12]  Gabriel Antoniu,et al.  Going back and forth: efficient multideployment and multisnapshotting on clouds , 2011, HPDC '11.

[13]  Emmanuel Jeannot,et al.  Adaptive online data compression , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[14]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[15]  David Hung-Chang Du,et al.  Towards efficient search on unstructured data: an intelligent-storage approach , 2007, CIKM '07.

[16]  Rafael Moreno-Vozmediano,et al.  Elastic management of cluster-based services in the cloud , 2009, ACDC '09.

[17]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[18]  Ashiquee Rasool Mohammad,et al.  Going Back and Forth: Efficient Multi-deployment and Multi-snapshotting on Clouds , 2012 .

[19]  Bogdan Nicolae,et al.  High Throughput Data-Compression for Cloud Storage , 2010, Globe.

[20]  Karsten Schwan,et al.  Efficient end to end data exchange using configurable compression , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[21]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[22]  Marcel Gagné Cooking with Linux: still searching for the ultimate linux distro? , 2007 .

[23]  Shahram Ghandeharizadeh,et al.  NAM: a network adaptable middleware to enhance response time of Web services , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[24]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[25]  Bowen Alpern,et al.  Opening black boxes: using semantic information to combat virtual machine image sprawl , 2008, VEE '08.

[26]  Gabriel Antoniu,et al.  BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[27]  Bogdan Nicolae,et al.  BlobSeer: Towards efficient data storage management for large-scale, distributed systems , 2010 .

[28]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[29]  Henk Sips,et al.  Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25-28, 2009. Proceedings , 2009, Euro-Par.

[30]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.