Non-parametric Decoding on Discrete Time Series and Its Applications in Bioinformatics

We address the question: How do we non-parametrically decode the unknown state-space vector underlying a lengthy discrete time series? The time series of concern is governed by one non-autonomous dynamics with only two internal states. This question pertinently reflects the dilemma of computing infeasibility against inferential bias found in many scientific areas. This dilemma becomes an issue when considering whether to have, or not to have likely very unrealistic structural assumptions on the state-space dynamics in most of real-world applications. To resolve this dilemma, the decoding problem is transformed into an event-intensity change-point problem without prior knowledge of the number of change-points involved. A new decoding algorithm, called Hierarchical Factor Segmentation (HFS), is proposed to achieve computability and robustness. Performance of the HFS algorithm in terms of total decoding error is compared to the decoding benchmark Viterbi algorithm through computer experiments. Under Hidden Markov Model (HMM) settings with true parameter values, our HFS algorithm is competitive against the Viterbi algorithm. Interestingly, when the Viterbi algorithm operates with maximum likelihood estimated (MLE) parameter values, our HFS algorithm performs significantly better. Similar favorable results are found when the Markov assumption is violated. We further demonstrate one very important application of our HFS algorithm in bioinformatics as a promising computational solution for finding CpG islands—DNA segments with aggregated CpG dinucleotides—on a genome sequence. A real illustration on a subsequence of human chromosome #22 is carried out and compared with one popular search algorithm.

[1]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[4]  Discovering stock dynamics through multidimensional volatility phases , 2012 .

[5]  Thomas Lengauer,et al.  CpG Island Mapping by Epigenome Prediction , 2007, PLoS Comput. Biol..

[6]  Jorma Rissanen,et al.  Stochastic Complexity in Learning , 1995, J. Comput. Syst. Sci..

[7]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[9]  M. West,et al.  Bayesian forecasting and dynamic models , 1989 .

[10]  A. Lanterman Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Selection , 2001 .

[11]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[12]  Chii-Ruey Hwang,et al.  Testing and mapping non-stationarity in animal behavioral processes: a case study on an individual female bean weevil. , 2006, Journal of theoretical biology.

[13]  Thomas C. M. Lee,et al.  An Introduction to Coding Theory and the Two‐Part Minimum Description Length Principle , 2001 .

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  H. Fushing,et al.  Statistical Computations on Biological Rhythms I: Dissecting Variable Cycles and Computing Signature Phases in Activity-Event Time Series , 2010 .

[16]  B. W. Turnbull,et al.  Non- and Semi-Parametric Estimation of the Receiver Operating Characteristic (ROC) Curve , 1992 .

[17]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[18]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  司履生 Cancer epigenetics , 2006 .

[22]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[23]  Joseph Naus,et al.  Approximations for Distributions of Scan Statistics , 1982 .

[24]  Stuart Geman,et al.  Dynamic programming and the graphical representation of error-correcting codes , 2001, IEEE Trans. Inf. Theory.

[25]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Michael A. West,et al.  Bayesian Forecasting and Dynamic Models (2nd edn) , 1997, J. Oper. Res. Soc..

[27]  Robert Savit,et al.  Stationarity and nonstationarity in time series analysis , 1996 .

[28]  B. Turnbull,et al.  NONPARAMETRIC AND SEMIPARAMETRIC ESTIMATION OF THE RECEIVER OPERATING CHARACTERISTIC CURVE , 1996 .

[29]  Siem Jan Koopman,et al.  Time Series Analysis by State Space Methods , 2001 .