UniRef: comprehensive and non-redundant UniProt reference clusters

MOTIVATION Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Jun Kawai,et al.  The Abundance of Short Proteins in the Mammalian Proteome , 2006, PLoS genetics.

[2]  Geoffrey J. Barton,et al.  Identification of multiple distinct Snf2 subfamilies with conserved structural motifs , 2006, Nucleic acids research.

[3]  F. Eisenhaber,et al.  Refinement and prediction of protein prenylation motifs , 2005, Genome Biology.

[4]  Arnaud Droit,et al.  Proteome profiling of human epithelial ovarian cancer cell line TOV-112D , 2005, Molecular and Cellular Biochemistry.

[5]  B. Roe,et al.  Highly syntenic regions in the genomes of soybean, Medicago truncatula, and Arabidopsis thaliana , 2005, BMC Plant Biology.

[6]  Miguel Lara,et al.  Sequencing and Analysis of Common Bean ESTs. Building a Foundation for Functional Genomics1[w] , 2005, Plant Physiology.

[7]  Takakazu Kaneko,et al.  Comprehensive structural analysis of the genome of red clover (Trifolium pratense L.). , 2005, DNA research : an international journal for rapid publication of reports on genes and genomes.

[8]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[9]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[10]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[11]  Robert Petryszak,et al.  The predictive power of the CluSTr database , 2005, Bioinform..

[12]  Cheng Lu,et al.  Genomic and Genetic Characterization of Rice Cen3 Reveals Extensive Transcription and Evolutionary Implications of a Complex Centromere[W][OA] , 2006, The Plant Cell Online.

[13]  Wei Zhu,et al.  The TIGR Plant Transcript Assemblies database , 2006, Nucleic Acids Res..

[14]  M. Gerstein,et al.  The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties , 2002, Genome Biology.

[15]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[16]  Gertraud Burger,et al.  AutoFACT: An Automatic Functional Annotation and Classification Tool , 2005, BMC Bioinformatics.

[17]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[18]  András Fiser,et al.  Saturating representation of loop conformational fragments in structure databanks , 2006, BMC Structural Biology.

[19]  Kai Wang,et al.  Incorporating background frequency improves entropy-based residue conservation measures , 2006, BMC Bioinform..

[20]  Alfonso Valencia,et al.  Death inducer obliterator protein 1 in the context of DNA regulation , 2005, The FEBS journal.

[21]  Xiaohong Wang,et al.  Databases and Information Integration for the Medicago truncatula Genome and Transcriptome1 , 2005, Plant Physiology.

[22]  Shuai Weng,et al.  Tetrahymena Genome Database (TGD): a new genomic resource for Tetrahymena thermophila research , 2005, Nucleic Acids Res..

[23]  Qiong Gao,et al.  pSTIING: a ‘systems’ approach towards integrating signalling pathways, interaction and transcriptional regulatory networks in inflammation and cancer , 2005, Nucleic Acids Res..

[24]  James E. Johnson,et al.  Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters , 2005, BMC Genomics.

[25]  James A. Casbon,et al.  On single and multiple models of protein families for the detection of remote sequence relationships , 2006, BMC Bioinformatics.

[26]  Liam J. McGuffin,et al.  High throughput profile-profile based fold recognition for the entire human proteome , 2006, BMC Bioinformatics.

[27]  Hongzhan Huang,et al.  Challenges and solutions in proteomics. , 2007, Current genomics.

[28]  Zhang-Zhi Hu,et al.  Comparative Bioinformatics Analyses and Profiling of Lysosome-Related Organelle Proteomes. , 2007, International journal of mass spectrometry.

[29]  James Mallet,et al.  A Conserved Supergene Locus Controls Colour Pattern Diversity in Heliconius Butterflies , 2006, PLoS biology.

[30]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[31]  Geoffrey J Barton,et al.  A normalised scale for structural genomics target ranking: The OB‐Score , 2006, FEBS letters.

[32]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[33]  Frank Eisenhaber,et al.  DOUTfinder—identification of distant domain outliers using subsignificant sequence similarity , 2006, Nucleic Acids Res..

[34]  J. MacKay,et al.  Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs , 2006, BMC Genomics.

[35]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[36]  Akira R. Kinjo,et al.  CRNPRED: highly accurate prediction of one-dimensional protein structures by large-scale critical random networks , 2006, BMC Bioinformatics.

[37]  William R. Taylor,et al.  Association of nucleotide patterns with gene function classes: application to human 3' untranslated sequences , 2002, Bioinform..

[38]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[39]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[40]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[41]  Zhang-Zhi Hu,et al.  Proteomic and bioinformatic characterization of the biogenesis and function of melanosomes. , 2006, Journal of proteome research.

[42]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[43]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[44]  Julia Steuber,et al.  Specific Modification of a Na+ Binding Site in NADH:Quinone Oxidoreductase from Klebsiella pneumoniae with Dicyclohexylcarbodiimide , 2006, Journal of bacteriology.

[45]  K. Silverstein,et al.  Genome Organization of More Than 300 Defensin-Like Genes in Arabidopsis1[w] , 2005, Plant Physiology.

[46]  Emma Jakobsson Structural Studies of Echinococcus granulosus Fatty-acid-binding Protein 1 and Human Semicarbazide-sensitive Amine Oxidase , 2005 .