Optimal segmentation using tree models

Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunications. Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study the description of sequence segments using variable length Markov chains (VLMCs), also known as tree models. We discover the segment boundaries of a sequence and at the same time we compute a VLMC for each segment. We use the Bayesian information criterion (BIC) and a variant of the minimum description length (MDL) principle that uses the Krichevsky-Trofimov (KT) code length to select the number of segments of a sequence. On DNA data the method selects segments that closely correspond to the annotated regions of the genes.

[1]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[2]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  J W Fickett,et al.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. , 1995, Journal of molecular biology.

[4]  Sergio VerdÂ,et al.  The Minimum Description Length Principle in Coding and Modeling , 2000 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Ath. Kehagias,et al.  A hidden Markov model segmentation procedure for hydrological and environmental time series , 2004 .

[7]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[8]  Imre Csiszár,et al.  Context tree estimation for not necessarily finite memory processes, via BIC and MDL , 2005, IEEE Transactions on Information Theory.

[9]  Heikki Mannila,et al.  Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[11]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[12]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[13]  Wojciech Szpankowski,et al.  An optimal DNA segmentation based on the MDL principle , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[14]  Wentian Li,et al.  DNA segmentation as a model selection process , 2001, RECOMB.

[15]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[16]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[17]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[18]  Jill P. Mesirov,et al.  Computational Biology , 2018, Encyclopedia of Parallel Computing.

[19]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[20]  Frans M. J. Willems,et al.  Context-tree maximizing , 2000 .

[21]  Jorma Rissanen Fast Universal Coding With Context Models , 1999, IEEE Trans. Inf. Theory.

[22]  Meir Feder,et al.  A universal finite memory source , 1995, IEEE Trans. Inf. Theory.

[23]  Heikki Mannila,et al.  Using Markov chain Monte Carlo and dynamic programming for event sequence data , 2005, Knowledge and Information Systems.

[24]  Mikhail S. Gelfand,et al.  Bayesian Approach to DNA Segmentation into Regions with Different Average Nucleotide Composition , 2000, JOBIM.

[25]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[26]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[27]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[28]  RECOGNIZING FUNCTIONAL DNA SITES AND SEGMENTING GENOMES USING THE PROGRAM " COMPLEXITY " , 2005 .

[29]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[30]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[31]  Darren J. Wilkinson,et al.  Detecting homogeneous segments in DNA sequences by using hidden Markov models , 2000 .