A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

MOTIVATION Many comparative genomics studies rely on the correct identification of homologous genomic regions using accurate alignment tools. In such case, the alphabet of the input sequences consists of complete genes, rather than nucleotides or amino acids. As optimal multiple sequence alignment is computationally impractical, a progressive alignment strategy is often employed. However, such an approach is susceptible to the propagation of alignment errors in early pairwise alignment steps, especially when dealing with strongly diverged genomic regions. In this article, we present a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists. RESULTS Based on provable properties of the graph structure, several heuristics are developed to resolve local alignment conflicts that occur due to gene duplication and/or rearrangement events on the different genomic segments. The performance of the algorithm is assessed by comparing the alignment results of homologous genomic segments in Arabidopsis thaliana to those obtained by using both a progressive alignment method and an earlier graph-based implementation. Especially for datasets that contain strongly diverged segments, the proposed method achieves a substantially higher alignment accuracy, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes. AVAILABILITY http://bioinformatics.psb.ugent.be/software. The algorithm is implemented as a part of the i-ADHoRe 3.0 package.

[1]  Y. Saeys,et al.  Building genomic profiles for uncovering segmental homology in the twilight zone. , 2004, Genome research.

[2]  Hyrum Carroll,et al.  DNA reference alignment benchmarks based on tertiary structure of encoded proteins , 2007, Bioinform..

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  Andrew H. Paterson,et al.  Synteny and Collinearity in Plant Genomes , 2008, Science.

[5]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[6]  Eric V. Denardo,et al.  Flows in Networks , 2011 .

[7]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[8]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[9]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[10]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Burkhard Morgenstern,et al.  A min-cut algorithm for the consistency problem in multiple sequence alignment , 2010, Bioinform..

[12]  J. Poulain,et al.  The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla , 2007, Nature.

[13]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[14]  Peter Elias,et al.  A note on the maximum flow through a network , 1956, IRE Trans. Inf. Theory.

[15]  Eduardo Corel,et al.  Automatic detection of anchor points for multiple sequence alignment , 2009, BMC Bioinformatics.

[16]  Sebastian Proost,et al.  The flowering world: a tale of duplications. , 2009, Trends in plant science.

[17]  Yves Van de Peer,et al.  i-ADHoRe 2.0: an improved tool to detect degenerated genomic homology using genomic profiles , 2008, Bioinform..

[18]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[19]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[20]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[23]  Yves Van de Peer,et al.  Computational approaches to unveiling ancient genome duplications , 2009 .

[24]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[25]  Pavel A. Pevzner,et al.  DRIMM-Synteny: decomposing genomes into evolutionary conserved segments , 2010, Bioinform..

[26]  C. Dieterich,et al.  CYNTENATOR: Progressive Gene Order Alignment of 17 Vertebrate Genomes , 2010, PloS one.

[27]  References , 1971 .

[28]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[29]  Hans-Peter Lenhof,et al.  An exact solution for the Segment-to-Segment multiple sequence alignment problem , 1998, German Conference on Bioinformatics.

[30]  Peter F Stadler,et al.  Alignments of mitochondrial genome arrangements: applications to metazoan phylogeny. , 2006, Journal of theoretical biology.

[31]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[32]  Yves Van de Peer,et al.  Computational approaches to unveiling ancient genome duplications , 2004, Nature Reviews Genetics.

[33]  David Sankoff,et al.  Multiple Genome Rearrangement and Breakpoint Phylogeny , 1998, J. Comput. Biol..

[34]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.