The power of amnesia: Learning probabilistic automata with variable memory length

We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we construct a simple stochastic model for E.coli DNA.

[1]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[2]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[3]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[4]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[7]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[8]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[9]  Frederick Jelinek,et al.  Markov Source Modeling of Text Generation , 1985 .

[10]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[11]  Milena Mihail,et al.  Conductance and convergence of Markov chains-a combinatorial treatment of expanders , 1989, 30th Annual Symposium on Foundations of Computer Science.

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  Anselm Blumer Applications of DAWGs to data compression , 1990 .

[14]  Manfred K. Warmuth,et al.  On the Computational Complexity of Approximating Distributions by Probabilistic Automata , 1990, COLT '90.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[17]  J. A. Fill Eigenvalue bounds on convergence to stationarity for nonreversible markov chains , 1991 .

[18]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[19]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[20]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[21]  Klaus-Uwe Höffgen,et al.  Learning and robust learning of product distributions , 1993, COLT '93.

[22]  Ronitt Rubinfeld,et al.  Efficient learning of typical finite automata from random walks , 1993, STOC.

[23]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[24]  P. Krishnan,et al.  Optimal prediction for prefetching in the worst case , 1994, SODA '94.

[25]  Hinrich Schütze,et al.  Part-of-Speech Tagging Using a Variable Memory Markov Model , 1994, ACL.

[26]  Michael Sipser,et al.  Inference and minimization of hidden Markov chains , 1994, COLT '94.

[27]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[28]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[29]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[30]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 2022 .

[31]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[32]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[33]  Ronald Saul,et al.  Discrete sequence prediction and its applications , 2005, Machine Learning.