Detecting distant homologies on protozoans metabolic pathways using scientific workflows

Bioinformatics experiments are typically composed of programs in pipelines manipulating an enormous quantity of data. An interesting approach for managing those experiments is through workflow management systems (WfMS). In this work we discuss WfMS features to support genome homology workflows and present some relevant issues for typical genomic experiments. Our evaluation used Kepler WfMS to manage a real genomic pipeline, named OrthoSearch, originally defined as a Perl script. We show a case study detecting distant homologies on trypanomatids metabolic pathways. Our results reinforce the benefits of WfMS over script languages and point out challenges to WfMS in distributed environments.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[5]  M. R. Adams,et al.  Comparative genomics of the eukaryotes. , 2000, Science.

[6]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[7]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[8]  F. Cohen,et al.  The Impact of Whole Genome Sequence Data on Drug Discovery—A Malaria Case Study , 2001, Molecular medicine.

[9]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[10]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[11]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[12]  N. White,et al.  The evolution of drug-resistant malaria: the role of drug elimination half-life. , 2002, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[13]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[14]  Wil M. P. van der Aalst,et al.  Business Process Management Demystified: A Tutorial on Models, Systems and Standards for Workflow Management , 2003, Lectures on Concurrency and Petri Nets.

[15]  Calton Pu,et al.  A modeling and execution environment for distributed scientific workflows , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[16]  Paulo F. Pires,et al.  Structural genomic workflows supported by Web services , 2003, 14th International Workshop on Database and Expert Systems Applications, 2003. Proceedings..

[17]  Serap Aksoy,et al.  Comparative genomics to uncover the secrets of tsetse and livestock-infective trypanosomes. , 2003, Trends in parasitology.

[18]  Bertram Ludäscher,et al.  A Framework for the Design and Reuse of Grid Workflows , 2004, SAG.

[19]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[20]  Gustavo Glusman,et al.  Genetic divergence of the rhesus macaque major histocompatibility complex. , 2004, Genome research.

[21]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004 .

[22]  J. Ehrenberg,et al.  Neglected diseases of neglected populations: Thinking to reshape the determinants of health in Latin America and the Caribbean , 2005, BMC public health.

[23]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[24]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[25]  Daniel Nilsson,et al.  Comparative Genomics of Trypanosomatid Parasitic Protozoa , 2005, Science.

[26]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[27]  James D. Myers,et al.  Adapting the electronic laboratory notebook for the semantic era , 2005, Proceedings of the 2005 International Symposium on Collaborative Technologies and Systems, 2005..

[28]  Wil M. P. van der Aalst,et al.  Life After BPEL? , 2005, EPEW/WS-FM.

[29]  Rizos Sakellariou,et al.  A taxonomy of grid monitoring systems , 2005, Future Gener. Comput. Syst..

[30]  Judith A. Blake,et al.  Beyond the data deluge: Data integration and bio-ontologies , 2006, J. Biomed. Informatics.

[31]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[32]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[33]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[34]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[35]  David Meredith,et al.  Evaluation of BPEL to Scientific Workflows , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[36]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[37]  David N. Messina,et al.  Evolutionary and Biomedical Insights from the Rhesus Macaque Genome , 2007, Science.

[38]  Ewa Deelman,et al.  Integrating existing scientific workflow systems: the Kepler/Pegasus example , 2007, WORKS '07.

[39]  Chris Mungall,et al.  A Chado case study: an ontology-based modular schema for representing genome-associated biological information , 2007, ISMB/ECCB.

[40]  Carole A. Goble,et al.  Using provenance to manage knowledge of In Silico experiments , 2007, Briefings Bioinform..

[41]  Yolanda Gil,et al.  Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows , 2007, AAAI.

[42]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[43]  Marta Mattoso,et al.  OrthoSearch: a scientific workflow approach to detect distant homologies on protozoans , 2008, SAC '08.

[44]  Marta Mattoso,et al.  ProtozoaDB: dynamic visualization and exploration of protozoan genomes , 2007, Nucleic Acids Res..

[45]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..