A structured hardware software architecture for peptide based diagnosis — Sub-string matching problem with limited tolerance

The problem of inferring proteins from complex peptide samples in shotgun proteomic workflow sets extreme demands on computational resources in respect of the required very high processing throughputs, rapid processing rates and reliability of results. This is exacerbated by the fact that, in general, a given protein cannot be defined by a fixed sequence of amino acids due to the existence of splice variants and isoforms of that protein. Therefore, the problem of protein inference could be considered as one of identifying sequences of amino acids with some limited tolerance. Two problems arise from this: a) due to these (permitted) variations, the applicability of exact string matching methodologies could be questioned and b) the difficulty of defining a reference (peptide/amino acid) sequence for a particular set of proteins that are functionally indistinguishable, but with some variation in features. This paper presents a model-based hardware acceleration of a structured and practical inference approach that is developed and validated to solve the inference problem in a mass spectrometry experiment of realistic size. Our approach starts from an examination of the known set of splice variants and isoforms of a target protein to identify the Greatest Common Stable Substring (GCSS) of amino acids and the Substrings Subjects to Limited Variation (SSLV) and their respective locations on the GCSS. The hypothesis made here is that these latter substrings (SSLV) appear inside complete peptides and not cutting across peptide boundaries. Then we define and solve the Sub-string Matching Problem with Limited Tolerance (SMPLT) using the Bit-Split Aho Corasick Algorithm with Limited Tolerance (BSACLT) that we define and automate. This approach is validated on identified peptides in a labelled and clustered data set from UNIPROT. A model-based hardware software co-design strategy is used to accelerate the computational workflow of above described protein inference problem. Identification of Baylisascaris Procyonis infection was used as an application instance. This workflow can be generalised to any inexact multiple pattern matching application by replacing the patterns in a clustered and distributed environment which permits a distance between member strings to account for permitted deviations such as substitutions, insertions and deletions. The co-designed workflow achieved up to 70 times maximum speed-up compared to a similar workflow purely run on the processor used for co-design.

[1]  Jin-Hong Shi Protein inference based on peptides identified from tandem mass spectra , 2012 .

[2]  L. Otvos Peptide-based drug design: here and now. , 2008, Methods in molecular biology.

[3]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[4]  Predrag Radivojac,et al.  Computational approaches to protein inference in shotgun proteomics , 2012, BMC Bioinformatics.

[5]  Yoginder S. Dandass,et al.  Accelerating String Set Matching in FPGA Hardware for Bioinformatics Research , 2008, BMC Bioinformatics.

[6]  Pavel A. Pevzner,et al.  Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[7]  Pavel A. Pevzner,et al.  Mutation-Tolerant Protein Identification by Mass Spectrometry , 2000, J. Comput. Biol..

[8]  S. D. Dewasurendra,et al.  Tile optimization for area in FPGA based hardware acceleration of peptide identification , 2011, 2011 6th International Conference on Industrial and Information Systems.

[9]  J. Yates,et al.  Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. , 2000, Analytical chemistry.

[10]  K. Resing,et al.  IsoformResolver: A Peptide-Centric Algorithm for Protein Inference , 2011, Journal of proteome research.

[11]  Christopher W. V. Hogue,et al.  Hardware Accelerated Novel Protein Identification , 2004, FPL.

[12]  C. Ahrens,et al.  PeptideClassifier for protein inference and targeted quantitative proteomics , 2010, Nature Biotechnology.