Processing Shotgun Proteomics Data on the Amazon Cloud with the Trans-Proteomic Pipeline*

Cloud computing, where scalable, on-demand compute cycles and storage are available as a service, has the potential to accelerate mass spectrometry-based proteomics research by providing simple, expandable, and affordable large-scale computing to all laboratories regardless of location or information technology expertise. We present new cloud computing functionality for the Trans-Proteomic Pipeline, a free and open-source suite of tools for the processing and analysis of tandem mass spectrometry datasets. Enabled with Amazon Web Services cloud computing, the Trans-Proteomic Pipeline now accesses large scale computing resources, limited only by the available Amazon Web Services infrastructure, for all users. The Trans-Proteomic Pipeline runs in an environment fully hosted on Amazon Web Services, where all software and data reside on cloud resources to tackle large search studies. In addition, it can also be run on a local computer with computationally intensive tasks launched onto the Amazon Elastic Compute Cloud service to greatly decrease analysis times. We describe the new Trans-Proteomic Pipeline cloud service components, compare the relative performance and costs of various Elastic Compute Cloud service instance types, and present on-line tutorials that enable users to learn how to deploy cloud computing technology rapidly with the Trans-Proteomic Pipeline. We provide tools for estimating the necessary computing resources and costs given the scale of a job and demonstrate the use of cloud enabled Trans-Proteomic Pipeline by performing over 1100 tandem mass spectrometry files through four proteomic search engines in 9 h and at a very low cost.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[3]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[4]  Hamid Mirzaei,et al.  Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. , 2012, Journal of proteome research.

[5]  Lennart Martens,et al.  ProteoCloud: a full-featured open source proteomics cloud computing pipeline. , 2013, Journal of proteomics.

[6]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[7]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[8]  Antti Ylä-Jääski,et al.  Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 , 2012, HotCloud.

[9]  Natalie I. Tasman,et al.  A guided tour of the Trans‐Proteomic Pipeline , 2010, Proteomics.

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Brian D Halligan,et al.  Low cost, scalable proteomics data analysis using Amazon's cloud computing services and open source search algorithms. , 2009, Journal of proteome research.

[12]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[13]  Henning Hermjakob,et al.  Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework , 2012, BMC Bioinformatics.

[14]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[15]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[16]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[17]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[18]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[19]  Henry Living,et al.  Review of: Seibold, Chris Mac OS X Snow Leopard. Pocket guide Sebastopol, CA: O'Reilly Media, Inc., 2009 , 2009, Inf. Res..

[20]  Yassene Mohammed,et al.  Cloud parallel processing of tandem mass spectrometry based proteomics data. , 2012, Journal of proteome research.

[21]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[22]  Henry H. N. Lam,et al.  Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. , 2008, Physiological genomics.

[23]  Richard D. Smith,et al.  Recommendations for mass spectrometry data quality metrics for open access data (corollary to the Amsterdam principles) , 2012, Proteomics.

[24]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[25]  J. Jeffry Howbert,et al.  MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services , 2012, Bioinform..

[26]  Jacob D. Jaffe,et al.  Proteogenomic mapping as a complementary method to perform genome annotation , 2004, Proteomics.

[27]  Michael D. Litton,et al.  IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. , 2009, Journal of proteome research.

[28]  Peter J. Tonellato,et al.  Biomedical Cloud Computing With Amazon Web Services , 2011, PLoS Comput. Biol..

[29]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[30]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.