FASTA Herder: a web application to trim protein sequence sets

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.

[1]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[2]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[3]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[4]  Guoli Wang,et al.  PISCES: recent improvements to a PDB sequence culling server , 2005, Nucleic Acids Res..

[5]  Carolina Perez-Iratxeta,et al.  Towards completion of the Earth's proteome , 2007, EMBO reports.

[6]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[7]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[8]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[9]  P. Bork,et al.  Orthology prediction methods: A quality assessment using curated protein families , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[10]  C. Ponting,et al.  Evolution of domain families. , 2000, Advances in protein chemistry.

[11]  Ivica Letunic,et al.  Visualization of multiple alignments, phylogenies and gene family evolution , 2010, Nature Methods.

[12]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..