Circular sequence comparison: algorithms and applications

BackgroundSequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences.ResultsIn this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

[1]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[2]  Chris Upton,et al.  Base-By-Base: Single nucleotide-level analysis of whole viral genome alignments , 2004, BMC Bioinformatics.

[3]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[4]  Travis J. Wheeler,et al.  Large-Scale Neighbor-Joining with NINJA , 2009, WABI.

[5]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[6]  Roberto Grossi,et al.  Circular Sequence Comparison with q-grams , 2015, WABI.

[7]  T. Saito,et al.  Gassericin A; an uncommon cyclic bacteriocin produced by Lactobacillus gasseri LA39 linked at N- and C-terminal ends. , 1998, Bioscience, biotechnology, and biochemistry.

[8]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[9]  Florin Manea,et al.  k-Abelian pattern matching , 2014, J. Discrete Algorithms.

[10]  Alair Pereira do Lago,et al.  Lossless filter for multiple repetitions with Hamming distance , 2008, J. Discrete Algorithms.

[11]  Sergio Barrachina,et al.  Speeding up the computation of the edit distance for cyclic strings , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[12]  E. Bornberg-Bauer,et al.  Evolution of circular permutations in multidomain proteins. , 2006, Molecular biology and evolution.

[13]  Zsuzsanna Lipták,et al.  Algorithms for Jumbled Pattern Matching in Strings , 2011, Int. J. Found. Comput. Sci..

[14]  Fabien Kuttler,et al.  Formation of non-random extrachromosomal elements during development, differentiation and oncogenesis. , 2007, Seminars in cancer biology.

[15]  Horst Bunke,et al.  Applications of approximate string matching to 2D shape recognition , 1993, Pattern Recognit..

[16]  Carlos Martín-Vide,et al.  Language and Automata Theory and Applications , 2015, Lecture Notes in Computer Science.

[17]  Our Correspondent in Molecular Biology Circular DNA , 1967, Nature.

[18]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[19]  I. Hirono,et al.  Comparative Sequence Analysis of a Multidrug-Resistant Plasmid from Aeromonas hydrophila , 2012, Antimicrobial Agents and Chemotherapy.

[20]  Costas S. Iliopoulos,et al.  Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment , 2015, SEA.

[21]  Jan Kok,et al.  Identification and Characterization of Two Novel Clostridial Bacteriocins, Circularin A and Closticin 574 , 2003, Applied and Environmental Microbiology.

[22]  Andreas Houben,et al.  Extrachromosomal circular DNA derived from tandemly repeated genomic sequences in plants. , 2007, The Plant journal : for cell and molecular biology.

[23]  Martin Wu,et al.  Phylogenomic Reconstruction Indicates Mitochondrial Ancestor Was an Energy Parasite , 2014, PloS one.

[24]  Francesc Calafell,et al.  Minimizing recombinations in consensus networks for phylogeographic studies , 2009, BMC Bioinformatics.

[25]  Costas S. Iliopoulos,et al.  Average-Case Optimal Approximate Circular String Matching , 2014, LATA.

[26]  Conan K. L. Wang,et al.  CyBase: a database of cyclic protein sequences and structures, with applications in protein discovery and engineering , 2007, Nucleic Acids Res..

[27]  Costas S. Iliopoulos,et al.  Fast circular dictionary-matching algorithm , 2015, Mathematical Structures in Computer Science.

[28]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[29]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[30]  M. Maes,et al.  On a Cyclic String-To-String Correction Problem , 1990, Inf. Process. Lett..

[31]  Joong Chae Na,et al.  Finding consensus and optimal alignment of circular strings , 2013, Theor. Comput. Sci..

[32]  Nadia Pisanti,et al.  Filters and seeds approaches for fast homology searches in large datasets , 2010 .

[33]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[34]  Wojciech Rytter,et al.  On the Maximal Number of Cubic Runs in a String , 2010, LATA.

[35]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[36]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[37]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[38]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[39]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[40]  Irit Gat-Viks,et al.  A minimum-labeling approach for reconstructing protein networks across multiple conditions , 2013, Algorithms for Molecular Biology.

[41]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[42]  J. Taanman,et al.  The mitochondrial genome: structure, transcription, translation and replication. , 1999, Biochimica et biophysica acta.

[43]  Costas S. Iliopoulos,et al.  Fast algorithms for approximate circular string matching , 2014, Algorithms for Molecular Biology.

[44]  A. Romeu,et al.  A sequence analysis of the β‐glucosidase sub‐family B , 1996 .

[45]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[46]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[47]  Luísa Pereira,et al.  mtDNA phylogeny and evolution of laboratory mouse strains. , 2007, Genome research.

[48]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[49]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[50]  Luísa Pereira,et al.  CSA: An efficient algorithm to improve circular DNA multiple alignment , 2008, BMC Bioinformatics.

[51]  David J Craik,et al.  Thematic Minireview Series on Circular Proteins , 2012, The Journal of Biological Chemistry.

[52]  Alistair Moffat,et al.  Plug and Play with Succinct Data Structures , 2014 .

[53]  Alair Pereira do Lago,et al.  Lossless filter for multiple repeats with bounded edit distance , 2008, Algorithms for Molecular Biology.

[54]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[55]  Peter F. Stadler,et al.  Comparative Analysis of Cyclic Sequences: Viroids and other Small Circular RNAs , 2006, German Conference on Bioinformatics.

[56]  R B Russell,et al.  Swaposins: circular permutations within genes encoding saposin homologues. , 1995, Trends in biochemical sciences.