论文信息 - An annotated k-deep prefix tree for (1-k)-mer based sequence comparisons

An annotated k-deep prefix tree for (1-k)-mer based sequence comparisons

In this report, we describe an algorithm for a k-deep annotated prefix tree. The algorithm provides an alignment-free method for comparing nucleotide sequences in a computationally efficient manner. Differences in genomic sequences are measured by recording and comparing counts of words of length k or less in each sequence using the algorithm. Tree nodes are annotated with lists to store the number of times each word occurs in each of a group of sequences. Count differences among multiple sequences may be computed in a single tree traversal. Such a tree is built in linear time and spatially bounded by tree depth rather than sequence length(s). We then compare sequence groups of both E. coli and Influenza A virus H1N1 to demonstrate the utilitiy of a k-deep prefix tree when used as sequence comparison tool.

[1] Mikel L. Forcada,et al. Incremental Construction and Maintenance of Minimal Finite-State Automata , 2002, CL.

[2] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3] John Riedl,et al. Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[5] Ignacio Marín,et al. A fast algorithm for the exhaustive analysis of 12-nucleotide-long DNA sequences. Applications to human genomics , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6] Stefano Lonardi,et al. Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[7] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[8] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[9] W. J. Kent,et al. BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10] J. Mullikin,et al. SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[11] Peter Wayner. Compression algorithms for real programmers , 1999 .

[12] Valery Kirzhner,et al. Large-scale genome clustering across life based on a linguistic approach. , 2005, Bio Systems.

[13] D. Davison,et al. d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[14] Ignacio Marín,et al. Fast comparison of DNA sequences by oligonucleotide profiling , 2008, BMC Research Notes.

[15] Bruce W. Watson,et al. Incremental construction of minimal acyclic finite state automata , 2000, CL.

[16] Sara Nasser,et al. Efficient Influenza A Virus Origin Detection , 2008 .

[17] Scott Hazelhurst,et al. An efficient implementation of the d 2 distance function for EST clustering: preliminary investigations , 2004 .