Proteogenomic Analysis of Polymorphisms and Gene Annotation Divergences in Prokaryotes using a Clustered Mass Spectrometry-Friendly Database*

Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software called multistrain mass spectrometry prokaryotic database builder (MSMSpdbb) that can merge protein databases from several sources and be applied on any prokaryotic organism, in a proteomic-friendly approach. We generated a database for the Mycobacterium tuberculosis complex (using three strains of Mycobacterium bovis and five of M. tuberculosis), and analyzed data collected from two laboratory strains and two clinical isolates of M. tuberculosis. We identified 2561 proteins, of which 24 were present in M. tuberculosis H37Rv samples, but not annotated in the M. tuberculosis H37Rv genome. We were also able to identify 280 nonsynonymous single amino acid polymorphisms and confirm 367 translational start sites. As a proof of concept we applied the database to whole-genome DNA sequencing data of one of the clinical isolates, which allowed the validation of 116 predicted single amino acid polymorphisms and the annotation of 131 N-terminal start sites. Moreover we identified regions not present in the original M. tuberculosis H37Rv sequence, indicating strain divergence or errors in the reference sequence. In conclusion, we demonstrated the potential of using a merged database to better characterize laboratory or clinical bacterial strains.

[1]  G. D. de Souza,et al.  Definition of novel cell envelope associated proteins in Triton X-114 extracts of Mycobacterium tuberculosis H37Rv , 2010, BMC Microbiology.

[2]  Bernd Thiede,et al.  Using a Label-free Proteomics Method to Identify Differentially Abundant Proteins in Closely Related Hypo- and Hypervirulent Clinical Mycobacterium tuberculosis Beijing Isolates , 2010, Molecular & Cellular Proteomics.

[3]  Gustavo A. de Souza,et al.  MSMSpdbb: providing protein databases of closely related organisms to improve proteomic characterization of prokaryotic microbes , 2010, Bioinform..

[4]  Lisa J. Murray,et al.  Genomic Diversity among Drug Sensitive and Multidrug Resistant Isolates of Mycobacterium tuberculosis with Identical DNA Fingerprints , 2009, PloS one.

[5]  P. V. van Helden,et al.  Evidence for a rapid rate of molecular evolution at the hypervariable and immunogenic Mycobacterium tuberculosis PPE38 gene region , 2009, BMC Evolutionary Biology.

[6]  Bernd Thiede,et al.  Validating divergent ORF annotation of the Mycobacterium leprae genome through a full translation data set and peptide identification by tandem mass spectrometry , 2009, Proteomics.

[7]  Julian Parkhill,et al.  A Comprehensive Survey of Single Nucleotide Polymorphisms (SNPs) across Mycobacterium bovis Strains and M. bovis BCG Vaccine Strains Refines the Genealogy and Defines a Minimal Set of SNPs That Separate Virulent M. bovis Strains and M. bovis BCG Strains , 2009, Infection and Immunity.

[8]  Saburo Yamamoto,et al.  Whole genome sequence analysis of Mycobacterium bovis bacillus Calmette-Guérin (BCG) Tokyo 172: a comparative study of BCG vaccine substrains. , 2009, Vaccine.

[9]  Stefan Niemann,et al.  High Functional Diversity in Mycobacterium tuberculosis Driven by Genetic Drift and Human Demography , 2008, PLoS biology.

[10]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[11]  Inge Jonassen,et al.  High accuracy mass spectrometry analysis as a tool to verify and improve gene annotation using Mycobacterium tuberculosis as an example , 2008, BMC Genomics.

[12]  Shengyue Wang,et al.  Genetic Basis of Virulence Attenuation Revealed by Comparative Genomic Analysis of Mycobacterium tuberculosis Strain H37Ra versus H37Rv , 2008, PloS one.

[13]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[14]  Blagoy Blagoev,et al.  A mass spectrometry–friendly database for cSNP identification , 2007, Nature Methods.

[15]  Michael R Brent,et al.  Genome annotation past, present, and future: how to define an ORF at each locus. , 2005, Genome research.

[16]  M. Mann,et al.  Parts per Million Mass Accuracy on an Orbitrap Mass Spectrometer via Lock Mass Injection into a C-trap*S , 2005, Molecular & Cellular Proteomics.

[17]  S. C. Rison,et al.  A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context , 2005, Nucleic acids research.

[18]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[19]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[20]  Julian Parkhill,et al.  The complete genome sequence of Mycobacterium bovis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  M. Mann,et al.  Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. , 2003, Analytical chemistry.

[22]  S. Salzberg,et al.  Whole-Genome Comparison of Mycobacterium tuberculosis Clinical and Laboratory Strains , 2002, Journal of bacteriology.

[23]  James I. Garrels,et al.  Yeast genomic databases and the challenge of the post-genomic era , 2002, Functional & Integrative Genomics.

[24]  M. Mann,et al.  What does it mean to identify a protein in proteomics? , 2002, Trends in biochemical sciences.

[25]  D. van Soolingen,et al.  Dealing with variation in molecular typing of Mycobacterium tuberculosis: low-intensity bands and other challenges. , 2001, Journal of medical microbiology.

[26]  Ross Overbeek,et al.  Genomics: what is realistically achievable? , 2000, Genome Biology.

[27]  Temple F. Smith,et al.  Operons in Escherichia coli: genomic analyses and predictions. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[28]  J. Musser,et al.  Negligible genetic diversity of mycobacterium tuberculosis host immune system protein targets: evidence of limited selective pressure. , 2000, Genetics.

[29]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[30]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide , 1999, Bioinform..

[31]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[32]  T. Whittam,et al.  Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[33]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[36]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[37]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD): a monitor of genome projects world-wide , 2001, Nucleic Acids Res..