CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

This paper proposes and evaluates an approach to the parallelization, deployment and management of bioinformatics applications that integrates several emerging technologies for distributed computing. The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and network virtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment. An implementation of this approach is described and used to demonstrate and evaluate the proposed approach. The implementation integrates Hadoop, Virtual Workspaces, and ViNe as the MapReduce, virtual machine and virtual network technologies, respectively, to deploy the commonly used bioinformatics tool NCBI BLAST on a WAN-based test bed consisting of clusters at two distinct locations, the University of Florida and the University of Chicago. This WAN-based implementation, called CloudBLAST, was evaluated against both non-virtualized and LAN-based implementations in order to assess the overheads of machine and network virtualization, which were shown to be insignificant. To compare the proposed approach against an MPI-based solution, CloudBLAST performance was experimentally contrasted against the publicly available mpiBLAST on the same WAN-based test bed. Both versions demonstrated performance gains as the number of available processors increased, with CloudBLAST delivering speedups of 57 against 52.4 of MPI version, when 64 processors on 2 sites were used. The results encourage the use of the proposed approach for the execution of large-scale bioinformatics applications on emerging distributed environments that provide access to computing resources as a service.

[1]  Jack Dongarra,et al.  Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers , 2007, PARA.

[2]  Li Gong Project JXTA: A Technology Overview , 2001 .

[3]  P. Oscar Boykin,et al.  WOW: Self-Organizing Wide Area Overlay Networks of Virtual Workstations , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[4]  Denis Caromel,et al.  ProActive: an integrated platform for programming and running applications on Grids and P2P systems , 2006 .

[5]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[6]  Peter A. Dinda,et al.  Towards Virtual Networks for Virtual Machine Grid Computing , 2004, Virtual Machine Research and Technology Symposium.

[7]  José A. B. Fortes,et al.  A virtual network (ViNe) architecture for grid computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[8]  Reagan Moore,et al.  The SDSC storage resource broker , 2010, CASCON.

[9]  Heinz Stockinger,et al.  Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[10]  Arun Krishnan GridBLAST: a Globus‐based high‐throughput implementation of BLAST in a Grid computing framework , 2005, Concurr. Comput. Pract. Exp..

[11]  Kees Verstoep,et al.  Wide-area communication for grids: an integrated solution to connectivity, performance and security problems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[12]  Satoshi Matsuoka,et al.  Distributed I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer , 2008 .

[13]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[14]  Miron Livny,et al.  Recovering internet symmetry in distributed computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[15]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[16]  Renato J. O. Figueiredo,et al.  Supporting application-tailored grid file system sessions with WSRF-based services , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[17]  Klaus Wehrle,et al.  OCALA: An Architecture for Supporting Legacy Applications over Overlays , 2006, NSDI.

[18]  Jason Maassen,et al.  Smartsockets: solving the connectivity problems in grid computing , 2007, HPDC '07.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[21]  Andrew G. Malis,et al.  A Framework for IP Based Virtual Private Networks , 2000, RFC.

[22]  Yuetsu Kodama,et al.  Efficient MPI Collective Operations for Clusters in Long-and-Fast Networks , 2006, 2006 IEEE International Conference on Cluster Computing.

[23]  Geyong Min Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, ISPA 2006 International Workshops, FHPCN, XHPC, S-GRACE, GridGIS, HPC-GTP, PDCE, ParDMCom, WOMP, ISDF, and UPWN, Sorrento, Italy, December 4-7, 2006, Proceedings , 2006, ISPA Workshops.

[24]  P. Oscar Boykin,et al.  WOW: Self-organizing Wide Area Overlay Networks of Virtual Workstations , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[25]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[28]  Jorge Andrade,et al.  Applications of Grid Computing in Genetics and Proteomics , 2006, PARA.

[29]  Satoshi Matsuoka,et al.  Making Wide-Area, Multi-site MPI Feasible Using Xen VM , 2006, ISPA Workshops.