Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

The rapid advancements in recent years of high-throughput technologies in the life sciences are facilitating the generation and storage of huge amount of data in different databases. Despite significant developments in computing capacity and performance, an analysis of these large-scale data in a search for biomedical relevant patterns remains a challenging task. Scientific workflow applications are deemed to support data-mining in more complex scenarios that include many data sources and computational tools, as commonly found in bioinformatics. A scientific workflow application is a holistic unit that defines, executes, and manages scientific applications using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consumption behaviours of workflow applications to enable a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud management infrastructure consisting of resource level-, application level monitoring techniques, and a knowledge management strategy to manage computational resources for supporting workflow application executions in order to guarantee their performance goals and their successful completion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present detailed evaluation results as a proof of concept.

[1]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[2]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[3]  Carole A. Goble,et al.  Seven Bottlenecks to Workflow Reuse and Repurposing , 2005, International Semantic Web Conference.

[4]  George Spanoudakis,et al.  Establishing and Monitoring SLAs in Complex Service Based Systems , 2009, 2009 IEEE International Conference on Web Services.

[5]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Johan Montagnat,et al.  Proceedings of the 6th workshop on Workflows in support of large-scale science , 2007, HiPC 2011.

[8]  Abhishek Tiwari,et al.  Workflow based framework for life science informatics , 2007, Comput. Biol. Chem..

[9]  César A. F. De Rose,et al.  DeSVi : An Architecture for Detecting SLA Violations in Cloud Computing Infrastructures , 2010 .

[10]  Wolfgang Kastner,et al.  Applying availability SLAs to traffic management systems , 2011, 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[11]  Arun Krishnan,et al.  Wildfire: distributed, Grid-enabled workflow construction and execution , 2004, BMC Bioinformatics.

[12]  Paolo Romano,et al.  Automation of in-silico data analysis processes through workflow management systems , 2007, Briefings Bioinform..

[13]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[14]  Ross S Hall,et al.  A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing , 2010, Nucleic acids research.

[15]  Robert Giegerich,et al.  Conveyor: a worko w engine for bioinformatic analyses , 2011 .

[16]  Schahram Dustdar,et al.  Low level Metrics to High level SLAs - LoM2HiS framework: Bridging the gap between monitored metrics and SLA parameters in cloud environments , 2010, 2010 International Conference on High Performance Computing & Simulation.

[17]  Saurabh Sinha,et al.  Empowering 21st Century Biology , 2010 .

[18]  Lutz Schubert,et al.  Towards autonomous SLA management using a proxy-like approach , 2007, Multiagent Grid Syst..

[19]  Ivona Brandic,et al.  Optimizing bioinformatics workflows for data analysis using cloud management techniques , 2011, WORKS '11.

[20]  Schahram Dustdar,et al.  Towards Knowledge Management in Self-Adaptable Clouds , 2010, 2010 6th World Congress on Services.

[21]  Rizos Sakellariou,et al.  Adaptive resource configuration for Cloud infrastructure management , 2013, Future Gener. Comput. Syst..

[22]  Carole A. Goble,et al.  Workflow discovery: the problem, a case study from e-Science and a graph-based solution , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[23]  Rizos Sakellariou,et al.  Simulating Autonomic SLA Enactment in Clouds Using Case Based Reasoning , 2010, ServiceWave.

[24]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[25]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[26]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[27]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[28]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[29]  Muli Ben-Yehuda,et al.  The Reservoir model and architecture for open federated cloud computing , 2009, IBM J. Res. Dev..

[30]  Rajkumar Buyya,et al.  Towards autonomic detection of SLA violations in Cloud infrastructures , 2012, Future Gener. Comput. Syst..

[31]  Rizos Sakellariou,et al.  Enacting SLAs in Clouds Using Rules , 2011, Euro-Par.

[32]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[33]  Fabio Panzieri,et al.  QoS–Aware Clouds , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[34]  H. Steven Wiley,et al.  Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling , 2011, Bioinform..

[35]  D. Hollingsworth The workflow Reference Model , 1994 .

[36]  Michael Zouberakis,et al.  Solutions for data integration in functional genomics: a critical assessment and case study , 2008, Briefings Bioinform..

[37]  Brian D Halligan,et al.  Low cost, scalable proteomics data analysis using Amazon's cloud computing services and open source search algorithms. , 2009, Journal of proteome research.

[38]  Lincoln D. Stein,et al.  Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges , 2008, Nature Reviews Genetics.

[39]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.