Highly scalable genome assembly on campus grids

Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.

[1]  Richard D. Schlichting,et al.  Tolerating failures in the bag-of-tasks programming paradigm , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[4]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[5]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[6]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  Mihai Pop,et al.  Genome Sequence Assembly: Algorithms and Issues , 2002, Computer.

[8]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[9]  K. B. McKusick,et al.  High-throughput gene mapping in Caenorhabditis elegans. , 2002, Genome research.

[10]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[11]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.

[12]  Miron Livny,et al.  Condor and the Grid , 2003 .

[13]  Michael Roberts,et al.  A Preprocessor for Shotgun Assembly of Large Genomes , 2004, J. Comput. Biol..

[14]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[15]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[16]  Srinivas Aluru,et al.  Assembling genomes on large-scale parallel computers , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  Ewan Birney,et al.  Update of the Anopheles gambiae PEST genome assembly , 2007, Genome Biology.

[18]  Dave Strenski,et al.  Exploring Accelerating Science Applications with FPGAs , 2007 .

[19]  Paul Avery,et al.  The Open Science Grid , 2007 .

[20]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  Srinivas Aluru,et al.  Parallel biological sequence alignments on the Cell Broadband Engine , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[22]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[23]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[24]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[25]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[26]  Mihaela M. Martis,et al.  The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[27]  Li Yi,et al.  Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions , 2009, HPDC '09.