论文信息 - QOMA2: Optimizing the alignment of many sequences

QOMA2: Optimizing the alignment of many sequences

We consider the problem of aligning multiple protein sequences with the goal of maximizing the SP (sum-of-pairs) score, when the number of sequences is large. The QOMA (quasi-optimal multiple alignment) algorithm addressed this problem when the number of sequences is small. However, as the number of sequences increases, QOMA becomes impractical. This paper develops a new algorithm, QOMA2, which optimizes the SP score of the alignment of arbitrarily large number of sequences. Given an initial (potentially sub-optimal) alignment , QOMA2 selects short subsequences from this alignment by placing a window on it. It quickly estimates the amount of improvement that can be obtained by optimizing the alignment of the subsequences in short windows on this alignment. This estimate is called the SW (sum of weights) score. It employs a dynamic programming algorithm that selects the set of window positions with the largest total expected improvement. It partitions the subsequences within each window into clusters such that the number of subsequences in each cluster is small enough to be optimally aligned within a given time. Also, it aims to select these clusters so that the optimal alignment of the subsequences in these clusters produces the highest expected SP score. The experimental results show that QOMA2 produces high SP scores quickly even for large number of sequences. They also show that the SW score and the resulting SP score are highly correlated. This implies that it is promising to aim for optimizing the SW score since it is much cheaper than aligning multiple sequences optimally. The software and the benchmark data set are available from the authors on request.

Tamer Kahveci | Xu Zhang

[1] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[2] Gary B. Fogel,et al. Improvement of clustal-derived sequence alignments with evolutionary algorithms , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[3] Burkhard Morgenstern,et al. DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[4] D. Higgins,et al. T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[5] S. Altschul,et al. A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[6] Robert C. Edgar,et al. MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7] Sean R. Eddy,et al. Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[8] Kurt Mehlhorn,et al. A branch-and-cut algorithm for multiple sequence alignment , 1997, RECOMB '97.

[9] D T Jones,et al. Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[10] Tao Jiang,et al. On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[11] Jens Stoye,et al. DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment , 1997, Comput. Appl. Biosci..

[12] Olivier Poch,et al. A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[13] W. Miller,et al. A time-efficient, linear-space local similarity algorithm , 1991 .

[14] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[15] Christus,et al. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[16] A. Phillips,et al. Multiple sequence alignment in phylogenetic analysis. , 2000, Molecular phylogenetics and evolution.

[17] Eugene W. Myers,et al. ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.