Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Abstract Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: des.higgins@ucd.ie Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[2]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[3]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[4]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[5]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[6]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[7]  William R Taylor,et al.  Prediction of contacts from correlated sequence substitutions. , 2013, Current opinion in structural biology.

[8]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[9]  Francesc Rosselló,et al.  A new balance index for phylogenetic trees , 2012, Mathematical biosciences.

[10]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[11]  M. J. Sackin,et al.  “Good” and “Bad” Phenograms , 1972 .

[12]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[14]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.

[15]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[16]  Fabian Sievers,et al.  Simple chained guide trees give high-quality protein multiple sequence alignments , 2014, Proceedings of the National Academy of Sciences.

[17]  Burkhard Rost,et al.  FreeContact: fast and free software for protein contact prediction from residue co-evolution , 2014, BMC Bioinformatics.

[18]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[19]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[20]  M. Gil,et al.  Phylogenetic assessment of alignments reveals neglected tree signal in gaps , 2010, Genome Biology.

[21]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[22]  Fabian Sievers,et al.  Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments , 2015, Proceedings of the National Academy of Sciences.

[23]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[24]  Luciano Milanesi,et al.  Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity , 2005, BMC Bioinformatics.

[25]  Manuel Gil,et al.  Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. , 2012, Methods in molecular biology.

[26]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Desmond G. Higgins,et al.  Systematic exploration of guide-tree topology effects for small protein alignments , 2014, BMC Bioinformatics.

[28]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[29]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[30]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[31]  Manuel Gil,et al.  Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks , 2015, Proceedings of the National Academy of Sciences.

[32]  Michael Lappe,et al.  CMView: Interactive contact map visualization and analysis , 2011, Bioinform..