Scientific Workflow Management in Proteomics

Data processing in proteomics can be a challenging endeavor, requiring extensive knowledge of many different software packages, all with different algorithms, data format requirements, and user interfaces. In this article we describe the integration of a number of existing programs and tools in Taverna Workbench, a scientific workflow manager currently being developed in the bioinformatics community. We demonstrate how a workflow manager provides a single, visually clear and intuitive interface to complex data analysis tasks in proteomics, from raw mass spectrometry data to protein identifications and beyond.

[1]  André M Deelder,et al.  Improving mass measurement accuracy in mass spectrometry based proteomics by combining open source tools for chromatographic alignment and internal calibration. , 2009, Journal of proteomics.

[2]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[3]  T. Oinn,et al.  Soaplab - a unified Sesame door to analysis tools , 2003 .

[4]  Günter Pomaska,et al.  PHP Hypertext Preprocessor , 2012 .

[5]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[6]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[7]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[8]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  Mario Cannataro,et al.  Sharing mass spectrometry data in a grid-based distributed proteomics laboratory , 2007, Inf. Process. Manag..

[11]  Kei-Hoi Cheung,et al.  X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. , 2008, Journal of proteome research.

[12]  Rune Matthiesen,et al.  Useful mass spectrometry programs freely available on the internet. , 2007, Methods in molecular biology.

[13]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[14]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[15]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[16]  Alfred V. Aho,et al.  The awk programming language , 1988 .

[17]  Jean Jacques Moreau,et al.  SOAP Version 1. 2 Part 1: Messaging Framework , 2003 .

[18]  Daniel Crawl,et al.  Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis , 2010, Ecol. Informatics.

[19]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[20]  Magnus Palmblad,et al.  Heat-shock response in Arabidopsis thaliana explored by multiplexed quantitative proteomics using differential metabolic labeling. , 2008, Journal of proteome research.

[21]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[22]  Jules J Berman,et al.  Perl: The Programming Language , 2008 .

[23]  Jason E Stajich,et al.  An Introduction to BioPerl. , 2007, Methods in molecular biology.

[24]  B. A. Tague,et al.  UNIX time-sharing system: Foreword , 1978, The Bell System Technical Journal.

[25]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[26]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[27]  Roberto Chinnici,et al.  Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language , 2007 .

[28]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[29]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[30]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[31]  Vasa Curcin,et al.  The design and implementation of a workflow analysis tool , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[32]  Magnus Palmblad,et al.  Automatic internal calibration in liquid chromatography/Fourier transform ion cyclotron resonance mass spectrometry of protein digests. , 2006, Rapid communications in mass spectrometry : RCM.

[33]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[34]  Alfonso Valencia,et al.  Interoperability with Moby 1.0--it's better than sharing your toothbrush! , 2008, Briefings in bioinformatics.

[35]  Magnus Palmblad,et al.  Chromatographic alignment of LC-MS and LC-MS/MS datasets by genetic algorithm feature extraction , 2007, Journal of the American Society for Mass Spectrometry.

[36]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[37]  Carole A. Goble,et al.  Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data , 2008, BMC Bioinformatics.

[38]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[39]  Anton Nijholt,et al.  Using R in Taverna: RShell v1.2 , 2009, BMC Research Notes.

[40]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[41]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[42]  David Fenyö,et al.  The Biopolymer Markup Language , 1999, Bioinform..

[43]  Michael A. Freitas,et al.  MassMatrix: A database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data , 2009, Proteomics.

[44]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[45]  Daniel Coca,et al.  High-performance hardware implementation of a parallel database search engine for real-time peptide mass fingerprinting , 2008, Bioinform..

[46]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[47]  Rune Matthiesen,et al.  Methods, algorithms and tools in computational proteomics: A practical point of view , 2007, Proteomics.

[48]  Ekaterina Mostovenko,et al.  A novel mass spectrometry cluster for high-throughput quantitative proteomics , 2010, Journal of the American Society for Mass Spectrometry.