A Performance Evaluation of X-Ray Crystallography Scientific Workflow Using SciCumulus

X-ray crystallography is an important field due to its role in drug discovery and its relevance in bioinformatics experiments of comparative genomics, phylogenomics, evolutionary analysis, ortholog detection, and three-dimensional structure determination. Managing these experiments is a challenging task due to the orchestration of legacy tools and the management of several variations of the same experiment. Workflows can model a coherent flow of activities that are managed by scientific workflow management systems (SWfMS). Due to the huge amount of variations of the workflow to be explored (parameters, input data) it is often necessary to execute X-ray crystallography experiments in High Performance Computing (HPC) environments. Cloud computing is well known for its scalable and elastic HPC model. In this paper, we present a performance evaluation for the X-ray crystallography workflow defined by the PC4 (Provenance Challenge series). The workflow was executed using the SciCumulus middleware at the Amazon EC2 cloud environment. SciCumulus is a layer for SWfMS that offers support for the parallel execution of scientific workflows in cloud environments with provenance mechanisms. Our results reinforce the benefits (total execution time × monetary cost) of parallelizing the X-ray crystallography workflow using SciCumulus. The results show a consistent way to execute X-ray crystallography workflows that need HPC using cloud computing. The evaluated workflow shares features of many scientific workflows and can be applied to other experiments.

[1]  Marta Mattoso,et al.  Data parallelism in bioinformatics workflows using Hydra , 2010, HPDC '10.

[2]  S F Campbell,et al.  Science, art and drug discovery: a personal perspective. , 2000, Clinical science.

[3]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[4]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2004, Distributed and Parallel Databases.

[5]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[6]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[7]  Mark Baker,et al.  MPJ: Enabling Parallel Simulations in Java , .

[8]  G. Terstappen,et al.  In silico research in drug discovery. , 2001, Trends in pharmacological sciences.

[9]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[10]  Lavanya Ramakrishnan,et al.  Seeking supernovae in the clouds: a performance study , 2010, HPDC '10.

[11]  Shantenu Jha,et al.  Exploring the Performance Fluctuations of HPC Workloads on Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[12]  Edward Walker,et al.  Challenges in executing large parameter sweep studies across widely distributed computing environments , 2007, CLADE '07.

[13]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[14]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[15]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[16]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[17]  Marta Mattoso,et al.  Parallelism in Bioinformatics Workflows , 2004, VECPAR.

[18]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  John Shalf,et al.  Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[20]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[21]  Marta Mattoso,et al.  Towards a Taxonomy for Cloud Computing from an e-Science Perspective , 2010, Cloud Computing.

[22]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .