A proteome quality index.

We present the Proteome Quality Index (PQI; http://pqi-list.org), a much-needed resource for users of bacterial and eukaryotic proteomes. Completely sequenced genomes for which there is an available set of protein sequences (the proteome) are given a one- to five-star rating supported by 11 different metrics of quality. The database indexes over 3000 proteomes at the time of writing and is provided via a website for browsing, filtering and downloading. Previous to this work, there was no systematic way to account for the large variability in quality of the thousands of proteomes, and this is likely to have profoundly influenced the outcome of many published studies, in particular large-scale comparative analyses. The lack of a measure of proteome quality is likely due to the difficulty in producing one, a problem that we have approached by integrating multiple metrics. The continued development and improvement of the index will require the contribution of additional metrics by us and by others; the PQI provides a useful point of reference for the scientific community, but it is only the first step towards a 'standard' for the field.

[1]  Susan E. Douglas,et al.  The Plastid Genome of the Cryptophyte Alga, Guillardia theta: Complete Sequence and Conserved Synteny Groups Confirm Its Common Ancestry with Red Algae , 1999, Journal of Molecular Evolution.

[2]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[3]  J. M. Rodríguez,et al.  Complete Genome Sequence of Lactobacillus fermentum CECT 5716, a Probiotic Strain Isolated from Human Milk , 2010, Journal of bacteriology.

[4]  M. Hattori,et al.  Comparative Genome Analysis of Lactobacillus reuteri and Lactobacillus fermentum Reveal a Genomic Island for Reuterin and Cobalamin Production , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[5]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[6]  David S. Goodsell,et al.  The RCSB Protein Data Bank: new resources for research and education , 2012, Nucleic Acids Res..

[7]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[8]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[9]  Guy Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[10]  Julian Gough,et al.  Genomic scale sub-family assignment of protein domains , 2006, Nucleic acids research.

[11]  J. Armengaud,et al.  The importance of recognizing and reporting sequence database contamination for proteomics , 2014 .

[12]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[13]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[14]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[15]  Jodie J. Yin,et al.  A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes , 2004, Genome Biology.

[16]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[17]  Samuel V. Angiuoli,et al.  Whole genome comparison of the A. fumigatus family. , 2006, Medical mycology.

[18]  Tatiana A. Tatusova,et al.  RefSeq microbial genomes database: new representation and annotation strategy , 2013, Nucleic Acids Res..

[19]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[20]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[21]  M. Blaxter,et al.  Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots , 2013, Front. Genet..

[22]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[23]  D. Ussery,et al.  Sigma factors in a thousand E. coli genomes. , 2013, Environmental microbiology.

[24]  Cyrus Chothia,et al.  Genomic and structural aspects of protein evolution. , 2009, The Biochemical journal.

[25]  Alexandros Stamatakis,et al.  A daily-updated tree of (sequenced) life as a reference for genome research , 2013, Scientific Reports.

[26]  Cyrus Chothia,et al.  SUPERFAMILY 1.75 including a domain-centric gene ontology method , 2010, Nucleic Acids Res..

[27]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..