Counter based suffix tree for DNA pattern repeats

Abstract In recent years, the string datasets have increased exponentially, so is the need to process them. Most of these datasets have been deeply rooted in the field of bioinformatics since the entire characteristics of any living organism is encoded in their genes. Genes consist of nucleic bases which will, therefore, makeup the entire genome. A genome is made of a concatenation of different types of nucleic bases. To efficiently extract the information encrypted in these sequences there is a need to use algorithms to decrypt it. Most available methods use the data structure commonly referred to as the suffix tree. They have tremendously evolved over the years, and the on-line construction of the suffix tree is deemed as the best data structure, however, it is not optimal when it comes to finding repeated sequences because of many traversals algorithm will have to do when identifying repeats. To improve the speed and of finding repeats we developed a counter based suffix tree algorithm. Our work presents a novel algorithm of constructing a counter based suffix tree without losing its properties. The counter based suffix tree time complexity is θ ( n ) where n represents the length of a string. Which is the same as the fastest suffix tree implementation. We have shown that the counter based suffix tree will reduce the search time when identifying repeats. We have proved that a counter based suffix tree can be developed during construction.

[1]  Alistair Moffat,et al.  Improving suffix array locality for fast pattern matching on disk , 2008, SIGMOD Conference.

[2]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[3]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[4]  Jignesh M. Patel,et al.  Practical Suffix Tree Construction , 2004, VLDB.

[5]  Srinivas Aluru,et al.  Obtaining Provably Good Performance from Suffix Trees in Secondary Storage , 2006, CPM.

[6]  Roberto Marangoni,et al.  BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Mohammed J. Zaki,et al.  TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees , 2008, Pacific Symposium on Biocomputing.

[8]  Swee Lay Thein,et al.  Hypervariable ‘minisatellite’ regions in human DNA , 1985, Nature.

[9]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[10]  Limsoon Wong,et al.  CPS-tree: A Compact Partitioned Suffix Tree for Disk-based Indexing on Large Genome Sequences , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Alex Thomo,et al.  Suffix trees for very large genomic sequences , 2009, CIKM.

[12]  N. Gilbert,et al.  Distinctive higher-order chromatin structure at mammalian centromeres , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Huo Hong An Adaptive Suffix Tree Based Algorithm for Repeats Identification in a DNA Sequence , 2010 .

[14]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[15]  Jignesh M. Patel,et al.  Practical methods for constructing suffix trees , 2005, The VLDB Journal.

[16]  Konstantin Makarychev,et al.  Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  Srikanta J. Bedathur,et al.  Search-Optimized Suffix-Tree Storage for Biological Applications , 2005, HiPC.

[19]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[20]  J. Noonan,et al.  Neanderthal genomics and the evolution of modern humans. , 2010, Genome research.

[21]  Dan He Using Suffix Tree to Discover Complex Repetitive Patterns in DNA Sequences , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[22]  H. Aburatani,et al.  The oncogenic mutation in the pleckstrin homology domain of AKT1 in endometrial carcinomas , 2009, British Journal of Cancer.

[23]  Stefano Lonardi,et al.  Discovery of repetitive patterns in DNA with accurate boundaries , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[24]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[25]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[26]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.

[27]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.