Modular and configurable optimal sequence alignment software: Cola

BackgroundThe fundamental challenge in optimally aligning homologous sequences is to define a scoring scheme that best reflects the underlying biological processes. Maximising the overall number of matches in the alignment does not always reflect the patterns by which nucleotides mutate. Efficiently implemented algorithms that can be parameterised to accommodate more complex non-linear scoring schemes are thus desirable.ResultsWe present Cola, alignment software that implements different optimal alignment algorithms, also allowing for scoring contiguous matches of nucleotides in a nonlinear manner. The latter places more emphasis on short, highly conserved motifs, and less on the surrounding nucleotides, which can be more diverged. To illustrate the differences, we report results from aligning 14,100 sequences from 3' untranslated regions of human genes to 25 of their mammalian counterparts, where we found that a nonlinear scoring scheme is more consistent than a linear scheme in detecting short, conserved motifs.ConclusionsCola is freely available under LPGL from https://github.com/nedaz/cola.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[3]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[4]  S F Altschul,et al.  Generalized affine gap costs for protein sequence alignment , 1998, Proteins.

[5]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[6]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[7]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[8]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[9]  Miriah D. Meyer,et al.  Genome-wide synteny through highly sensitive sequence alignment: Satsuma , 2010, Bioinform..

[10]  Richard Mott Local sequence alignments with monotonic gap penalties , 1999, Bioinform..

[11]  Trevor I. Dix,et al.  A Versatile Divide and Conquer Technique for Optimal String Alignment , 1999, Inf. Process. Lett..

[12]  G. Crooks,et al.  A generalized affine gap model significantly improves protein sequence alignment accuracy , 2004, Proteins.

[13]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.