A space-efficient algorithm for three sequence alignment and ancestor inference

We propose a novel algorithm to simultaneously align three biological sequences with affine gap model and infer their common ancestral sequence. It applies the divide-and-conquer strategy to reduce the memory usage from O(n3) to O(n2). At the same time, it is based on dynamic programming and thus the optimal alignment is guaranteed. We implemented the algorithm and tested it extensively with both BAliBASE dataset and simulation data generated by Random Model of Sequence Evolution (ROSE). Compared with other popular multiple sequence alignment tools such as ClustalW and T-Coffee, our program produces not only better alignment, but also better ancestral sequence.

[1]  David Richard Powell,et al.  Algorithms for Sequence Alignment , 2001 .

[2]  MARTIN VINGRON,et al.  Towards Integration of Multiple Alignment and Phylogenetic Tree Construction , 1997, J. Comput. Biol..

[3]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[4]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[5]  L. Allison,et al.  Fast, optimal alignment of three sequences using linear gap costs. , 2000, Journal of theoretical biology.

[6]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[7]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[8]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[9]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[12]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[13]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[14]  R. Ravi,et al.  GESTALT: Genomic Steiner Alignments , 1999, CPM.

[15]  J Hein,et al.  A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. , 1989, Molecular biology and evolution.

[16]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[17]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..