Using Virtual Clusters to Decouple Computation and Data Management in High Throughput Analysis Applications

The rapid growth in the throughput to cost ratio of experimental data production technologies is generating vast amounts of scientific data, often organized into "large" objects (genomes, bio-images) exhibiting complex internal structures. Frequently, datasets must be shared between multiple research groups interested not only in the final results, but also in how they are produced. The practical difficulties of moving terabytes or more of data across the network, as well as the need to maintain a clear separation between software stack and storage infrastructure, are thus raising interest in the use of virtual clusters for HPC and data intensive applications. In this paper we employ a MapReduce implementation of an image analysis pipeline used by deep sequencing platforms to analyse different virtual cluster scenarios and their impact on system performance.

[1]  Gianluigi Zanetti,et al.  Pydoop: a Python MapReduce and HDFS API for Hadoop , 2010, HPDC '10.

[2]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[3]  David Wolinsky,et al.  On the Design of Virtual Machine Sandboxes for Distributed Computing in Wide-area Overlays of Virtual Workstations , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[4]  Borja Sotomayor,et al.  Virtual Clusters for Grid Communities , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  David Abrahams,et al.  Building hybrid systems with Boost.Python , 2003 .

[8]  Matthew S. Mayernik,et al.  Drowning in data: digital library architecture to support scientific use of embedded sensor networks , 2007, JCDL '07.

[9]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.

[10]  Matthew Disney,et al.  Scalable Deployment and Configuration of High-Performance Virtual Clusters , 2008 .

[11]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[12]  Rubén S. Montero,et al.  Cloud Computing for on-Demand Grid Resource Provisioning , 2008, High Performance Computing Workshop.

[13]  Satoshi Matsuoka,et al.  Virtual Clusters on the Fly - Fast, Scalable, and Flexible Installation , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).