Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

BackgroundFor shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.ResultsWe present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.ConclusionThe software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.

[1]  Jimmy K Eng,et al.  Fast parallel tandem mass spectral library searching using GPU hardware acceleration. , 2011, Journal of proteome research.

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[4]  Kei-Hoi Cheung,et al.  X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. , 2008, Journal of proteome research.

[5]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[6]  Douglas J. Baxter,et al.  Large improvements in MS/MS-based peptide identification rates using a hybrid analysis. , 2011, Journal of proteome research.

[7]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[8]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[9]  Lennart Martens,et al.  PRIDE Inspector: a tool to visualize and validate MS proteomics data , 2011, Nature Biotechnology.

[10]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  M. Mann,et al.  Phosphotyrosine interactome of the ErbB-receptor kinase family , 2005, Molecular systems biology.

[13]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[14]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[15]  Andrew J Link,et al.  Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. , 2005, Journal of proteome research.

[16]  Jung Hun Oh,et al.  Peptide identification by tandem mass spectra: an efficient parallel searching , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[17]  J. Jeffry Howbert,et al.  MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services , 2012, Bioinform..

[18]  Daniel Coca,et al.  High-performance hardware implementation of a parallel database search engine for real-time peptide mass fingerprinting , 2008, Bioinform..

[19]  Brendan MacLean,et al.  General framework for developing and evaluating database scoring algorithms using the TANDEM search engine , 2006, Bioinform..