Systematic Evaluation of Protein Sequence Filtering Algorithms for Proteoform Identification Using Top‐Down Mass Spectrometry

Complex proteoforms contain various primary structural alterations resulting from variations in genes, RNA, and proteins. Top‐down mass spectrometry is commonly used for analyzing complex proteoforms because it provides whole sequence information of the proteoforms. Proteoform identification by top‐down mass spectral database search is a challenging computational problem because the types and/or locations of some alterations in target proteoforms are in general unknown. Although spectral alignment and mass graph alignment algorithms have been proposed for identifying proteoforms with unknown alterations, they are extremely slow to align millions of spectra against tens of thousands of protein sequences in high throughput proteome level analyses. Many software tools in this area combine efficient protein sequence filtering algorithms and spectral alignment algorithms to speed up database search. As a result, the performance of these tools heavily relies on the sensitivity and efficiency of their filtering algorithms. Here, we propose two efficient approximate spectrum‐based filtering algorithms for proteoform identification. We evaluated the performances of the proposed algorithms and four existing ones on simulated and real top‐down mass spectrometry data sets. Experiments showed that the proposed algorithms outperformed the existing ones for complex proteoform identification. In addition, combining the proposed filtering algorithms and mass graph alignment algorithms identified many proteoforms missed by ProSightPC in proteome‐level proteoform analyses.

[1]  N. Kelleher,et al.  Top Down proteomics: facts and perspectives. , 2014, Biochemical and biophysical research communications.

[2]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[3]  Ronald J Moore,et al.  Ischemia in Tumors Induces Early and Sustained Phosphorylation Changes in Stress Kinase Pathways but Does Not Affect Global Protein Levels* , 2014, Molecular & Cellular Proteomics.

[4]  Hao Chi,et al.  pTop 1.0: A High-Accuracy and High-Efficiency Search Engine for Intact Protein Identification. , 2016, Analytical chemistry.

[5]  F. McLafferty,et al.  Automated reduction and interpretation of , 2000, Journal of the American Society for Mass Spectrometry.

[6]  Ying S. Ting,et al.  Protein Identification Using Top-Down Spectra* , 2012, Molecular & Cellular Proteomics.

[7]  A. Nesvizhskii,et al.  Improved sequence tag generation method for peptide identification in tandem mass spectrometry. , 2008, Journal of proteome research.

[8]  D. Goodlett,et al.  Precursor ion independent algorithm for top-down shotgun proteomics , 2009, Journal of the American Society for Mass Spectrometry.

[9]  Ronald J Moore,et al.  Enhanced top-down characterization of histone post-translational modifications , 2012, Genome Biology.

[10]  Christodoulos A Floudas,et al.  High Throughput Characterization of Combinatorial Histone Codes* , 2009, Molecular & Cellular Proteomics.

[11]  Neil L Kelleher,et al.  Pervasive combinatorial modification of histone H3 in human cells , 2007, Nature Methods.

[12]  Richard D. Smith,et al.  De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. , 2008, Analytical chemistry.

[13]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[14]  Pavel A. Pevzner,et al.  Peptide sequence tags for fast database search in mass-spectrometry. , 2005 .

[15]  P. Pevzner,et al.  Deconvolution and Database Search of Complex Tandem Mass Spectra of Intact Proteins , 2010, Molecular & Cellular Proteomics.

[16]  Ying Peng,et al.  MASH Suite Pro: A Comprehensive Software Tool for Top-Down Proteomics* , 2015, Molecular & Cellular Proteomics.

[17]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[18]  Tao Xu,et al.  Sequence Analysis , 2006 .

[19]  P. Pevzner,et al.  Identification of ultramodified proteins using top-down tandem mass spectra. , 2013, Journal of proteome research.

[20]  Xiaowen Liu,et al.  A mass graph‐based approach for the identification of modified proteoforms using top‐down tandem mass spectra , 2016, Bioinform..

[21]  Lennart Opitz,et al.  Altered Histone Acetylation Is Associated with Age-Dependent Memory Impairment in Mice , 2010, Science.

[22]  Li Ding,et al.  Endocrine-therapy-resistant ESR1 variants revealed by genomic characterization of breast-cancer-derived xenografts. , 2013, Cell reports.

[23]  Lloyd M. Smith,et al.  Proteoform: a single term describing protein complexity , 2013, Nature Methods.

[24]  Jungkap Park,et al.  Informed-Proteomics: Open Source Software Package for Top-down Proteomics , 2017, Nature Methods.

[25]  Qiang Kou,et al.  A new scoring function for top-down spectral deconvolution , 2014, BMC Genomics.

[26]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[27]  Anthony J. Cesnik,et al.  Elucidating Proteoform Families from Proteoform Intact-Mass and Lysine-Count Measurements , 2016, Journal of proteome research.

[28]  P. Pevzner,et al.  Interpreting top-down mass spectra using spectral alignment. , 2008, Analytical chemistry.

[29]  Z. Tian,et al.  Interpreting raw biological mass spectra using isotopic mass-to-charge ratio and envelope fingerprinting. , 2013, Rapid communications in mass spectrometry : RCM.

[30]  Qiang Kou,et al.  TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization , 2016, Bioinform..

[31]  N. M. Karabacak,et al.  Sensitive and Specific Identification of Wild Type and Variant Proteins from 8 to 669 kDa Using Top-down Mass Spectrometry*S , 2009, Molecular & Cellular Proteomics.

[32]  Yong J. Kil,et al.  Byonic: Advanced Peptide and Protein Identification Software , 2012, Current protocols in bioinformatics.

[33]  Richard D. LeDuc,et al.  New and automated MSn approaches for top-down identification of modified proteins , 2005, Journal of the American Society for Mass Spectrometry.

[34]  Ying Ge,et al.  Augmented Phosphorylation of Cardiac Troponin I in Hypertensive Heart Failure* , 2011, The Journal of Biological Chemistry.

[35]  Navdeep Jaitly,et al.  DeconMSn: a software tool for accurate parent ion monoisotopic mass determination for tandem mass spectra , 2008, Bioinform..

[36]  Vineet Bafna,et al.  Speeding up tandem mass spectral identification using indexes , 2012, Bioinform..

[37]  Nuno Bandeira,et al.  Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra* , 2011, Molecular & Cellular Proteomics.

[38]  Yong-Bin Kim,et al.  ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry , 2007, Nucleic Acids Res..

[39]  Kun Zhang,et al.  pFind-Alioth: A novel unrestricted database search algorithm to improve the interpretation of high-resolution MS/MS data. , 2015, Journal of proteomics.

[40]  Shuai Cheng Li,et al.  Spectral probabilities of top-down tandem mass spectra , 2014, BMC Genomics.

[41]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[42]  David L Tabb,et al.  DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. , 2008, Journal of proteome research.

[43]  Lusheng Wang,et al.  An efficient algorithm for the blocked pattern matching problem , 2015, Bioinform..

[44]  Joshua F. McMichael,et al.  Genome Remodeling in a Basal-like Breast Cancer Metastasis and Xenograft , 2010, Nature.

[45]  David Fenyö,et al.  Integrated Bottom-Up and Top-Down Proteomics of Patient-Derived Breast Tumor Xenografts* , 2015, Molecular & Cellular Proteomics.

[46]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[47]  Alexey I Nesvizhskii,et al.  MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics , 2017, Nature Methods.