MapReduce in the Clouds for Science

The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has become the weapon of choice for data-intensive analyses in the clouds and in commodity clusters due to its excellent fault tolerance features, scalability and the ease of use. Currently, there are several options for using MapReduce in cloud environments, such as using MapReduce as a service, setting up one’s own MapReduce cluster on cloud instances, or using specialized cloud MapReduce runtimes that take advantage of cloud infrastructure services. In this paper, we introduce Azure MapReduce, a novel MapReduce runtime built using the Microsoft Azure cloud infrastructure services. Azure MapReduce architecture successfully leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an efficient, on demand alternative to traditional MapReduce clusters. Further we evaluate the use and performance of MapReduce frameworks, including Azure MapReduce, in cloud environments for scientific applications using sequence assembly and sequence alignment as use cases.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Tejaswi Redkar,et al.  Windows Azure Platform , 2010 .

[3]  William Gropp,et al.  MPI (Message Passing Interface) , 2011, Encyclopedia of Parallel Computing.

[4]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  Michael Harpham December , 1855, The Hospital.

[7]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[8]  Anthony J. G. Hey,et al.  Jim Gray on eScience: a transformed scientific method , 2009, The Fourth Paradigm.

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[11]  Judy Qiu,et al.  Cloud Technologies for Bioinformatics Applications , 2011, IEEE Trans. Parallel Distributed Syst..

[12]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Constantinos Evangelinos,et al.  Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere- , 2008 .

[15]  Geoffrey C. Fox,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2010, HPDC '10.