Proteomics Standards Initiative Extended FASTA Format (PEFF)

Mass spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs), in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI Extended FASTA Format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backwards compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available at http://www.psidev.info/peff.

[1]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[2]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[3]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[4]  Harald Barsnes,et al.  The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics* , 2017, Molecular & Cellular Proteomics.

[5]  Wout Bittremieux,et al.  Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing. , 2018, Journal of proteome research.

[6]  Juan Antonio Vizcaíno,et al.  The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition , 2016, Nucleic Acids Res..

[7]  Martin Eisenacher,et al.  The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results , 2012, Molecular & Cellular Proteomics.

[8]  Andrew R Jones,et al.  phpMs: A PHP-Based Mass Spectrometry Utilities Library. , 2018, Journal of proteome research.

[9]  Juan Antonio Vizcaíno,et al.  The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data , 2017, Genome Biology.

[10]  Luisa Montecchi-Palazzi,et al.  The PSI-MOD community standard for representation of protein modification data , 2008, Nature Biotechnology.

[11]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[12]  Jian Zhang,et al.  Protein Ontology (PRO): enhancing and scaling up the representation of protein entities , 2016, Nucleic Acids Res..

[13]  Jun Fan,et al.  The mzTab Data Exchange Format: Communicating Mass-spectrometry-based Proteomics and Metabolomics Experimental Results to a Wider Audience* , 2014, Molecular & Cellular Proteomics.

[14]  B. Searle,et al.  A Face in the Crowd: Recognizing Peptides Through Database Search* , 2011, Molecular & Cellular Proteomics.

[15]  John R Yates,et al.  Mass spectrometry in high-throughput proteomics: ready for the big time , 2010, Nature Methods.

[16]  Natalie I. Tasman,et al.  A guided tour of the Trans‐Proteomic Pipeline , 2010, Proteomics.

[17]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[18]  Luis Mendoza,et al.  Flexible and Fast Mapping of Peptides to a Proteome with ProteoMapper. , 2018, Journal of proteome research.

[19]  Lloyd M. Smith,et al.  Proteoform: a single term describing protein complexity , 2013, Nature Methods.

[20]  Martin Eisenacher,et al.  The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary , 2013, Database J. Biol. Databases Curation.

[21]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[22]  Eric W. Deutsch,et al.  File Formats Commonly Used in Mass Spectrometry Proteomics* , 2012, Molecular & Cellular Proteomics.

[23]  Juliana de Saldanha da Gama Fischer,et al.  Comparing intestinal versus diffuse gastric cancer using a PEFF-oriented proteomic pipeline. , 2018, Journal of proteomics.

[24]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[25]  Amos Bairoch,et al.  The neXtProt knowledgebase on human proteins: 2017 update , 2016, Nucleic Acids Res..

[26]  Sam Hanash,et al.  The Human Proteome Organization , 2002, Molecular & Cellular Proteomics.

[27]  Lennart Martens,et al.  The PSI formal document process and its implementation on the PSI website , 2007, Proteomics.

[28]  Hilla Peretz,et al.  The , 1966 .

[29]  D. Creasy,et al.  Unimod: Protein modifications for mass spectrometry , 2004, Proteomics.

[30]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[31]  Mathias Wilhelm,et al.  PROTEOFORMER 2.0: Further Developments in the Ribosome Profiling-assisted Proteogenomic Hunt for New Proteoforms* , 2019, Molecular & Cellular Proteomics.

[32]  Luis Mendoza,et al.  Trans‐Proteomic Pipeline, a standardized data processing pipeline for large‐scale reproducible proteomics informatics , 2015, Proteomics. Clinical applications.

[33]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[34]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[35]  Martin Eisenacher,et al.  The mzQuantML Data Standard for Mass Spectrometry–based Quantitative Studies in Proteomics , 2013, Molecular & Cellular Proteomics.

[36]  Rolf Apweiler,et al.  The Proteomics Standards Initiative , 2003, Proteomics.

[37]  Lennart Martens,et al.  mzML—a Community Standard for Mass Spectrometry Data* , 2010, Molecular & Cellular Proteomics.

[38]  Martin Eisenacher,et al.  Proteomics Standards Initiative: Fifteen Years of Progress and Future Work , 2017, Journal of proteome research.

[39]  Martin Eisenacher,et al.  Controlled vocabularies and ontologies in proteomics: Overview, principles and practice , 2014, Biochimica et biophysica acta.

[40]  Lennart Martens,et al.  TraML—A Standard Format for Exchange of Selected Reaction Monitoring Transition Lists* , 2011, Molecular & Cellular Proteomics.

[41]  Martin Eisenacher,et al.  Development of data representation standards by the human proteome organization proteomics standards initiative , 2015, J. Am. Medical Informatics Assoc..

[42]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[43]  Maria Jesus Martin,et al.  The Proteins API: accessing key integrated protein and genome information , 2017, Nucleic Acids Res..

[44]  W. Van Criekinge,et al.  PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration , 2014, Nucleic acids research.

[45]  Henry H. N. Lam,et al.  Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. , 2008, Physiological genomics.

[46]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Michael R Shortreed,et al.  ProForma: A Standard Proteoform Notation. , 2018, Journal of proteome research.

[48]  Peter R Baker,et al.  In-depth Analysis of Tandem Mass Spectrometry Data from Disparate Instrument Types*S , 2008, Molecular & Cellular Proteomics.

[49]  R. Apweiler,et al.  Finding one's way in proteomics: a protein species nomenclature , 2009, Chemistry Central journal.

[50]  Lennart Martens,et al.  Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. , 2020, Mass spectrometry reviews.

[51]  Lev I Levitsky,et al.  Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. , 2018, Journal of proteome research.