A benchmark study of sequence alignment methods for protein clustering

BackgroundProtein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable.ResultsResults showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results.ConclusionsThese results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results.

[1]  J. Pei,et al.  Multiple protein sequence alignment. , 2008, Current opinion in structural biology.

[2]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[4]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[5]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[6]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[7]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[8]  Ricardo J. G. B. Campello,et al.  On the Comparison of Relative Clustering Validity Criteria , 2009, SDM.

[9]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[10]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[11]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[12]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[13]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[14]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[15]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[16]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[17]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[18]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[19]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[20]  Desmond G. Higgins,et al.  Analysis and Comparison of Benchmarks for Multiple Sequence Alignment , 2006, Silico Biol..

[21]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[22]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[23]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[24]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[25]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[26]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[27]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[28]  Robert C. Edgar,et al.  Quality measures for protein alignment benchmarks , 2010, Nucleic acids research.

[29]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[30]  Fabrice Armougom,et al.  Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee , 2006, Nucleic Acids Res..

[31]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[32]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[33]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[34]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[35]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[36]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Olivier Poch,et al.  A new protein linear motif benchmark for multiple sequence alignment software , 2008, BMC Bioinformatics.

[38]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[41]  Tu Minh Phuong,et al.  Multiple alignment of protein sequences with repeats and rearrangements , 2006, Nucleic acids research.

[42]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[43]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[44]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[45]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[46]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[47]  Stephanie Boehm,et al.  Applied Multivariate Techniques , 2016 .

[48]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[49]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[50]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[51]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[52]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[53]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[54]  Yaoqi Zhou,et al.  SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. , 2005, Bioinformatics.

[55]  Jian Li,et al.  Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data , 2010, Nucleic acids research.

[56]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[57]  C. Sander,et al.  Are binding residues conserved? , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[58]  Andrew E. Torda,et al.  Not assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[59]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[60]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[61]  H O Villar,et al.  Amino acid preferences at protein binding sites , 1994, FEBS letters.

[62]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[63]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[64]  Guilherme Oliveira,et al.  Assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[65]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[66]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[67]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[68]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[69]  N. Grishin,et al.  PROMALS3D: a tool for multiple protein sequence and structure alignments , 2008, Nucleic acids research.

[70]  J. Thompson,et al.  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[71]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[72]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[73]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[74]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[75]  N. Grishin,et al.  Crystal structure of YbaK protein from Haemophilus influenzae (HI1434) at 1.8 Å resolution: Functional implications , 2000, Proteins.

[76]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[77]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[78]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[79]  Subhash Sharma Applied multivariate techniques , 1995 .

[80]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.