Construction of stochastic context trees for genetical texts

A method has been developed for constructing a tree source model for genetic text generation. Model visualisation in the form of suffix (context) trees provides a new way of context analysis of symbol sequences. Estimation of the stochastic complexity of the data in the frame of the model serves as a criterion for the model's ascertainment. The model and complexity values are used for analysis of genetic texts. The software realisation of this algorithm enables to reveal statistical properties of genetic sequences based on an information measure. The program developed is available via Internet at http://wwwmgs.bionet.nsc.ru/mgs/programs/complexity/.

[1]  Mikhail S. Gelfand,et al.  Segmentation of yeast DNA using hidden Markov models , 1999, Bioinform..

[2]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[3]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[4]  Victor G. Levitsky,et al.  Nucleosomal DNA property database , 1999, Bioinform..

[5]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[6]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[7]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  E N Trifonov,et al.  The multiple codes of nucleotide sequences. , 1989, Bulletin of mathematical biology.

[10]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[11]  E. Trifonov [Genetic level of DNA sequences is determined by superposition of many codes]. , 1997, Молекулярная биология.

[12]  Kenta Nakai,et al.  Modeling and predicting transcriptional units of <$O_SSF>Escherichia coli<$C_SSF>genes using hidden Markov models , 1999, Bioinform..

[13]  C T Zhang A symmetrical theory of DNA sequences and its applications. , 1997, Journal of theoretical biology.

[14]  D M Crothers,et al.  Identification and characterization of genomic nucleosome-positioning sequences. , 1997, Journal of molecular biology.

[15]  D. Torney,et al.  The stationary statistical properties of human coding sequences. , 1999, Journal of molecular biology.

[16]  Michael G. Sadovsky,et al.  Classification of Symbol Sequences over Their Frequency Dictionaries: Towards the Connection between Structure and Natural Taxonomy , 2000 .

[17]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[18]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.

[19]  M Rani,et al.  Pair-preferences: a quantitative measure of regularities in protein sequences. , 1996, Journal of biomolecular structure & dynamics.

[20]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[21]  H Herzel,et al.  Information content of protein sequences. , 2000, Journal of theoretical biology.

[22]  S. Buldyrev,et al.  Species independence of mutual information in coding and noncoding DNA. , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[23]  G S Mani Long-range doublet correlations in DNA and the coding regions. , 1992, Journal of theoretical biology.

[24]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[25]  Y. Almirantis A standard deviation based quantification differentiates coding from non-coding DNA sequences and gives insight to their evolutionary history. , 1999, Journal of theoretical biology.

[26]  Jorma Rissanen Fast Universal Coding With Context Models , 1999, IEEE Trans. Inf. Theory.

[27]  H. Herzel,et al.  Estimating the entropy of DNA sequences. , 1997, Journal of theoretical biology.

[28]  T P Speed,et al.  Atypical regions in large genomic DNA sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Borodovsky,et al.  Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences. , 1996, Journal of molecular biology.

[30]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[31]  R Nussinov,et al.  Doublet frequencies in evolutionary distinct groups. , 1984, Nucleic acids research.

[32]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[33]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[34]  D Häring,et al.  Variations of the mononucleotide and short oligonucleotide distributions in the genomes of various organisms. , 1999, Journal of theoretical biology.

[35]  W. Ebeling,et al.  On grammars, complexity, and information measures of biological macromolecules , 1980 .

[36]  Vladimir D. Gusev,et al.  On the complexity measures of genetic sequences , 1999, Bioinform..

[37]  A. Stein,et al.  A signal encoded in vertebrate DNA that influences nucleosome positioning and alignment. , 1999, Nucleic acids research.