UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation. Availability and implementation: Web access and file download from UniProt website at http://www.uniprot.org/uniref and ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. BLAST searches against UniRef are available at http://www.uniprot.org/blast/ Contact: huang@dbi.udel.edu

[1]  The UniProt Consortium,et al.  Update on activities at the Universal Protein Resource (UniProt) in 2013 , 2012, Nucleic Acids Res..

[2]  Todd H. Oakley,et al.  Gene duplication and the origins of morphological complexity in pancrustacean eyes, a genomic approach , 2010, BMC Evolutionary Biology.

[3]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[4]  A. Godzik,et al.  Sequence clustering strategies improve remote homology recognitions while reducing search times. , 2002, Protein engineering.

[5]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[6]  Ketil Malde,et al.  Increasing Sequence Search Sensitivity with Transitive Alignments , 2013, PloS one.

[7]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[8]  Russ B. Altman,et al.  Improving the prediction of disease-related variants using protein three-dimensional structure , 2011, BMC Bioinformatics.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Robert D. Finn,et al.  Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation , 2011, PloS one.

[11]  Tatsuya Akutsu,et al.  Clustering of database sequences for fast homology search using upper bounds on alignment score. , 2004, Genome informatics. International Conference on Genome Informatics.

[12]  Li Ni,et al.  The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species , 2009, PLoS Comput. Biol..

[13]  R. Altman,et al.  A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. , 2011, Genomics.

[14]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[15]  Christos A. Ouzounis,et al.  The properties of protein family space depend on experimental design , 2005, Bioinform..

[16]  Elon Portugaly,et al.  Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space , 2008, ISMB.

[17]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[18]  David A. Lee,et al.  Identification and distribution of protein families in 120 completed genomes using Gene3D , 2005, Proteins.

[19]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[20]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[21]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[22]  Hugh E. Williams,et al.  Clustered Sequence Representation for Fast Homology Search , 2007, J. Comput. Biol..

[23]  Daniel J. Nasko,et al.  VIROME: a standard operating procedure for analysis of viral metagenome sequences , 2012, Standards in genomic sciences.

[24]  Eugene Kolker,et al.  Quantifying Protein Function Specificity in the Gene Ontology , 2010, Standards in genomic sciences.

[25]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  M. Gerstein,et al.  The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties , 2002, Genome Biology.

[28]  Peer Bork,et al.  A Computational Screen for Type I Polyketide Synthases in Metagenomics Shotgun Data , 2008, PloS one.

[29]  Anthony J. Kusalik,et al.  The oligodeoxynucleotide sequences corresponding to never-expressed peptide motifs are mainly located in the non-coding strand , 2010, BMC Bioinformatics.

[30]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[31]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[32]  Cédric Notredame,et al.  Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee , 2012, BMC Bioinformatics.