Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this article, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences.

[1]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[2]  Armando J. Pinho,et al.  On finding minimal absent words , 2009, BMC Bioinformatics.

[3]  Roberto Grossi,et al.  Circular Sequence Comparison with q-grams , 2015, WABI.

[4]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[5]  Luísa Pereira,et al.  mtDNA phylogeny and evolution of laboratory mouse strains. , 2007, Genome research.

[6]  Gabriele Fici,et al.  Minimal Forbidden Words and Applications , 2006 .

[7]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[8]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[9]  Solon P. Pissis,et al.  Parallelising the Computation of Minimal Absent Words , 2015, PPAM.

[10]  Armando J. Pinho,et al.  Three minimal sequences found in Ebola virus genomes and absent from human DNA , 2015, Bioinform..

[11]  Peter F. Stadler,et al.  Comparative Analysis of Cyclic Sequences: Viroids and other Small Circular RNAs , 2006, German Conference on Bioinformatics.

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[13]  Antonio Restivo,et al.  Forbidden Words in Symbolic Dynamics , 2000, Adv. Appl. Math..

[14]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[15]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[16]  Juha Kärkkäinen,et al.  Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform , 2013, ESA.

[17]  Maxime Crochemore,et al.  Using minimal absent words to build phylogeny , 2012, Theor. Comput. Sci..

[18]  Jian Huang,et al.  Regularized gene selection in cancer microarray meta-analysis , 2009, BMC Bioinformatics.

[19]  Travis J. Wheeler,et al.  Large-Scale Neighbor-Joining with NINJA , 2009, WABI.

[20]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[21]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[22]  Bernhard Haubold,et al.  Efficient estimation of pairwise distances between genomes , 2009, Bioinform..

[23]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[24]  Hiroyoshi Morita,et al.  On fast and memory-efficient construction of an antidictionary array , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[25]  Armando J. Pinho,et al.  Minimal Absent Words in Prokaryotic and Eukaryotic Genomes , 2011, PloS one.

[26]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[27]  Armando J. Pinho,et al.  Minimal Absent Words in Four Human Genome Assemblies , 2011, PloS one.

[28]  Hiroyoshi Morita,et al.  On antidictionary coding based on compacted substring automaton , 2013, 2013 IEEE International Symposium on Information Theory.

[29]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[30]  Antonio Restivo,et al.  Words and forbidden factors , 2002, Theor. Comput. Sci..

[31]  Costas S. Iliopoulos,et al.  Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment , 2015, SEA.

[32]  Paolo Magni,et al.  A Perl procedure for protein identification by Peptide Mass Fingerprinting , 2009, BMC Bioinformatics.

[33]  Solon P. Pissis,et al.  Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[34]  Hiroyoshi Morita,et al.  On a universal antidictionary coding for stationary ergodic sources with finite alphabet , 2014, 2014 International Symposium on Information Theory and its Applications.

[35]  M. Maes,et al.  On a Cyclic String-To-String Correction Problem , 1990, Inf. Process. Lett..