TomusBlobs: scalable data‐intensive processing on Azure clouds

The emergence of cloud computing has brought the opportunity to use large‐scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the ‘elasticity’ in resource usage and associated costs (the users only pay for resources actually used), cloud applications still suffer from the high latencies and low performance of cloud storage services. As Big Data analysis on clouds becomes more and more relevant in many application areas, enabling high‐throughput massive data processing on cloud data becomes a critical issue, as it impacts the overall application performance. In this paper, we address this challenge at the level of cloud storage. We introduce a concurrency‐optimized data storage system (called TomusBlobs), which federates the virtual disks associated to the Virtual Machines running the application code on the cloud. We demonstrate the performance benefits of our solution for efficient data‐intensive processing by building an optimized prototype MapReduce framework for Microsoft's Azure cloud platform on the basis of TomusBlobs. Finally, we specifically address the limitations of state‐of‐the‐art MapReduce frameworks for reduce‐intensive workloads, by proposing MapIterativeReduce as an extension of the MapReduce model. We validate the aforementioned contributions through large‐scale experiments with synthetic benchmarks and with real‐world applications on the Azure commercial cloud by using resources distributed across multiple data centers; they demonstrate that our solutions bring substantial benefits to data‐intensive applications compared with approaches relying on state‐of‐the‐art cloud object storage. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Gabriel Antoniu,et al.  A performance evaluation of Azure and Nimbus clouds for scientific applications , 2012, CloudCP '12.

[2]  Wei Lu,et al.  CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Gabriel Antoniu,et al.  MapIterativeReduce: a framework for reduction-intensive data processing on azure clouds , 2012, MapReduce '12.

[4]  Yuan Luo,et al.  Hierarchical MapReduce Programming Model and Scheduling Algorithms , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[5]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[6]  Tak-Lon Wu,et al.  Scalable parallel computing on clouds using Twister4Azure iterative MapReduce , 2013, Future Gener. Comput. Syst..

[7]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[8]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[9]  Alexandra Carpen-Amarie,et al.  MapReduce Applications in the Cloud: A Cost Evaluation of Computation and Storage , 2012, Globe.

[10]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, Journal of Grid Computing.

[11]  Gabriel Antoniu,et al.  TomusBlobs: Towards Communication-Efficient Storage for MapReduce Applications in Azure , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[13]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[14]  John Bresnahan,et al.  Cumulus: an open source storage cloud for science , 2011, ScienceCloud '11.

[15]  Vincent Frouin,et al.  Imaging Genetics: Bio-Informatics and Bio-Statistics Challenges , 2010, COMPSTAT.

[16]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[17]  V. Chiang,et al.  Eucalyptus , 2008, Economic Botany.

[18]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[19]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[20]  Adam Auton,et al.  The 1000 Genomes Project , 2015 .

[21]  Osamu Tatebe,et al.  Gfarm Grid File System , 2010, New Generation Computing.

[22]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[24]  Rajkumar Buyya,et al.  High-Performance Cloud Computing: A View of Scientific Applications , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.