An annotated k-deep prefix tree for (1-k)-mer based sequence comparisons

In this report, we describe an algorithm for a k-deep annotated prefix tree. The algorithm provides an alignment-free method for comparing nucleotide sequences in a computationally efficient manner. Differences in genomic sequences are measured by recording and comparing counts of words of length k or less in each sequence using the algorithm. Tree nodes are annotated with lists to store the number of times each word occurs in each of a group of sequences. Count differences among multiple sequences may be computed in a single tree traversal. Such a tree is built in linear time and spatially bounded by tree depth rather than sequence length(s). We then compare sequence groups of both E. coli and Influenza A virus H1N1 to demonstrate the utilitiy of a k-deep prefix tree when used as sequence comparison tool.

[1]  Mikel L. Forcada,et al.  Incremental Construction and Maintenance of Minimal Finite-State Automata , 2002, CL.

[2]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Ignacio Marín,et al.  A fast algorithm for the exhaustive analysis of 12-nucleotide-long DNA sequences. Applications to human genomics , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[7]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[8]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[11]  Peter Wayner Compression algorithms for real programmers , 1999 .

[12]  Valery Kirzhner,et al.  Large-scale genome clustering across life based on a linguistic approach. , 2005, Bio Systems.

[13]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[14]  Ignacio Marín,et al.  Fast comparison of DNA sequences by oligonucleotide profiling , 2008, BMC Research Notes.

[15]  Bruce W. Watson,et al.  Incremental construction of minimal acyclic finite state automata , 2000, CL.

[16]  Sara Nasser,et al.  Efficient Influenza A Virus Origin Detection , 2008 .

[17]  Scott Hazelhurst,et al.  An efficient implementation of the d 2 distance function for EST clustering: preliminary investigations , 2004 .