Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

BackgroundAn absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. In this paper we explore the idea of using minimal absent words (MAW) to compute the distance between two biological sequences. The motivation and rationale of our work comes from the potential advantage of being able to extract as little information as possible from large genomic sequences to reach the goal of comparing sequences in an alignment-free manner.FindingsWe report an experimental study on the use of absent words as a distance measure among biological sequences. We provide recommendations to use the best index based on our analysis. In particular, our analysis reveals that the best performers are: the length weighted index of relative absent word sets, the length weighted index of the symmetric difference of the MAW sets, and the Jaccard distance between the MAW sets. We also found that during the computation of the absent words, the reverse complements of the sequences should also be considered.ConclusionThe use of MAW to compute the distance between two biological sequences has potential advantage over alignment based methods. It is expected that this potential advantage would encourage researchers and practitioners to use this as a (dis)similarity measure in the context of sequence comparison and phylogeny reconstruction. Therefore, we present here a comparison among different possible models and indexes and pave the path for the biologists and researchers to choose an appropriate model for such comparisons.

[1]  Tao Jiang,et al.  Efficient computation of shortest absent words in a genomic sequence , 2010, Inf. Process. Lett..

[2]  Tian-ming Wang,et al.  A relative similarity measure for the similarity analysis of DNA sequences , 2005 .

[3]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[4]  Solon P. Pissis,et al.  Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[5]  Filippo Mignosi,et al.  Minimal Forbidden Patterns of Multi-Dimensional Shifts , 2005, Int. J. Algebra Comput..

[6]  Maxime Crochemore,et al.  Using minimal absent words to build phylogeny , 2012, Theor. Comput. Sci..

[7]  Antonio Restivo,et al.  Computing forbidden words of regular languages , 2003, Fundam. Informaticae.

[8]  Louis J. Gross Algorithms in Bioinformatics: A Practical Introduction , 2009 .

[9]  A. Restivo,et al.  Text Compression Using Antidictionaries , 1999, ICALP.

[10]  Robert Giegerich,et al.  BMC Bioinformatics BioMed Central Methodology article Efficient computation of absent words in genomic sequences , 2008 .

[11]  Antonio Restivo,et al.  Words and forbidden factors , 2002, Theor. Comput. Sci..

[12]  Antonio Restivo,et al.  Minimal Forbidden Words and Symbolic Dynamics , 1996, STACS.

[13]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[14]  Costas S. Iliopoulos,et al.  Symposium on Theoretical Aspects of Computer Science , 2008 .

[15]  Armando J. Pinho,et al.  On finding minimal absent words , 2009, BMC Bioinformatics.

[16]  Gonzalo Navarro,et al.  Improved antidictionary based compression , 2002, 12th International Conference of the Chilean Computer Science Society, 2002. Proceedings..

[17]  Timothy L. Andersen,et al.  Absent Sequences: Nullomers and Primes , 2006, Pacific Symposium on Biocomputing.

[18]  Antonio Restivo,et al.  Word assembly through minimal forbidden words , 2006, Theor. Comput. Sci..

[19]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[20]  Antonio Restivo,et al.  Forbidden Factors and Fragment Assembly , 2001, RAIRO Theor. Informatics Appl..

[21]  Armando J. Pinho,et al.  Three minimal sequences found in Ebola virus genomes and absent from human DNA , 2015, Bioinform..

[22]  Armando J. Pinho,et al.  Minimal Absent Words in Four Human Genome Assemblies , 2011, PloS one.