Large scale multiple sequence alignment with simultaneous phylogeny inference

Multiple sequence alignment (MSA) and phylogenetic tree reconstruction are one of the most important problems in the computational biology. While both these problems are of great practical significance, in most cases they are very computationally demanding. In this paper we propose a new approach to the MSA problem which simultaneously infers an underlying phylogenetic tree. To process large data sets we provide parallel implementation of our method, which is based on the distributed caching of intermediate results. Finally, we show a parallel server designed for grid environments, and we report results of experiments performed with actual biological data, e.g. 1000 ribosomal RNA sequences.

[1]  Azer Bestavros,et al.  Popularity-aware greedy dual-size Web proxy caching algorithms , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[2]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[3]  M. Kimura,et al.  The neutral theory of molecular evolution. , 1983, Scientific American.

[4]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[5]  Eric Stahlberg,et al.  A component-based implementation of multiple sequence alignment , 2003, SAC '03.

[6]  Yves Van de Peer,et al.  The European database on small subunit ribosomal RNA , 2002, Nucleic Acids Res..

[7]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[8]  Michael Kaufmann,et al.  DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors , 2004, BMC Bioinformatics.

[9]  M. Holder,et al.  Phylogeny estimation: traditional and Bayesian approaches , 2003, Nature Reviews Genetics.

[10]  Jiye Shi,et al.  HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families , 2001, Bioinform..

[11]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[12]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[13]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[14]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[15]  Burkhard Rost,et al.  PHD - an automatic mail server for protein secondary structure prediction , 1994, Comput. Appl. Biosci..

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  M. Gouy,et al.  HOVERGEN: a database of homologous vertebrate genes. , 1994, Nucleic acids research.

[18]  Tao Jiang,et al.  Aligning sequences via an evolutionary tree: complexity and approximation , 1994, STOC '94.

[19]  Jaap Heringa,et al.  Parallelized multiple alignment , 2002, Bioinform..

[20]  Amitava Datta,et al.  Multiple sequence alignment in parallel on a workstation cluster , 2004, Bioinform..

[21]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[23]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy , 2003, Nucleic Acids Res..

[24]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .