Distributed workflow-driven analysis of large-scale biological data using biokepler

Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges, placing unprecedented demands on traditional single-processor bioinformatics algorithms. Technologies like scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This invited talk discusses the challenges related to next-generation sequencing data and explains the approaches taken in bioKepler to help with analysis of such data.

[1]  Daniel James Goodman,et al.  Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation , 2007, WWW '07.

[2]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[3]  Jun Qin,et al.  Advanced data flow support for scientific grid workflow applications , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[5]  Jianwu Wang,et al.  Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: An Ecological Example , 2009, 2009 Congress on Services - I.

[6]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[7]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[8]  Alexander S. Szalay,et al.  Data-Intensive Computing in the 21st Century , 2008, Computer.

[9]  Ian J. Taylor,et al.  Triana Applications within Grid Computing and Peer to Peer Environments , 2003, Journal of Grid Computing.

[10]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[11]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[12]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[13]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Shiyong Lu,et al.  A MapReduce-Enabled Scientific Workflow Composition Framework , 2009, 2009 IEEE International Conference on Web Services.