An optimal DNA segmentation based on the MDL principle

The biological world is highly stochastic as well as inhomogeneous in its behavior. The transition between homogeneous and inhomogeneous regions of DNA, known also as change points, carry important biological information. Our goal is to employ rigorous methods of information theory to quantify structural properties of DNA sequences. In particular, we adopt the Stein-Ziv lemma to find asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source and assuring exponentially small false positives. Then we apply the minimum description length (MDL) principle to select parameters of our segmentation algorithm. Finally, we perform extensive experimental work on human chromosome 9. After grouping A and G (purines) and T and C (pyrimidines) we discover change points between coding and noncoding regions as well as the beginning of a CpG island.

[1]  Neri Merhav,et al.  On the minimum description length principle for sources with piecewise constant parameters , 1993, IEEE Trans. Inf. Theory.

[2]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[3]  Simon Kasif,et al.  Computational methods in molecular biology , 1998 .

[4]  Jacob Ziv,et al.  On classification with empirically observed statistics and universal data compression , 1988, IEEE Trans. Inf. Theory.

[5]  Ramón Román-Roldán,et al.  DECOMPOSITION OF DNA SEQUENCE COMPLEXITY , 1999 .

[6]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[7]  G. I. Shamir,et al.  Asymptotically optimal low complexity sequential lossless coding for regular piecewise stationary memoryless sources , 1999, Proceedings of the 1999 IEEE Information Theory and Communications Workshop (Cat. No. 99EX253).

[8]  Mikhail A. Roytberg,et al.  DNA Segmentation Through the Bayesian Approach , 2000, J. Comput. Biol..

[9]  Paul W. Goldberg,et al.  Statistical Identification of Uniformly Mutated Segments within Repeats , 2002, CPM.

[10]  P. Bernaola-Galván,et al.  Compositional segmentation and long-range fractal correlations in DNA sequences. , 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[11]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[12]  Wentian Li,et al.  DNA segmentation as a model selection process , 2001, RECOMB.

[13]  Michael Gutman,et al.  Asymptotically optimal classification for multiple tests with empirically observed statistics , 1989, IEEE Trans. Inf. Theory.

[14]  Ioan Tabus,et al.  DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[15]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[16]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[17]  Mireille Régnier,et al.  On Pattern Frequency Occurrences in a Markovian Sequence , 1998, Algorithmica.

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[20]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[21]  W Li,et al.  New stopping criteria for segmenting DNA sequences. , 2001, Physical review letters.

[22]  Rolf Backofen,et al.  COMPUTATIONAL MOLECULAR BIOLOGY: AN INTRODUCTION , 2000 .

[23]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[24]  David Martin,et al.  Computational Molecular Biology: An Algorithmic Approach , 2001 .

[25]  Neri Merhav,et al.  Low-complexity sequential lossless coding for piecewise-stationary memoryless sources , 1998, IEEE Trans. Inf. Theory.

[26]  K. Roeder,et al.  A statistical model for locating regulatory regions in genomic DNA. , 1997, Journal of molecular biology.