On-Line Cumulative Learning of Hierarchical Sparse -grams

We present a system for on-line, cumulative learning of hierarchical collections of frequent patterns from unsegmented data streams. Such learning is critical for long-lived intelligent agents in complex worlds. Learned patterns enable prediction of unseen data and serve as building blocks for higher-level knowledge representation. We introduce a novel sparse -gram model that, unlike pruned -grams, learns on-line by stochastic search for frequent -tuple patterns. Adding patterns as data arrives complicates probability calculations. We discuss an EM approach to this problem and introduce hierarchical sparse -grams, a model that uses a better solution based on a new method for combining information across levels. A second new method for combining information from multiple granularities ( -gram widths) enables these models to more effectively search for frequent patterns (an on-line, stochastic analog of pruning in association rule mining). The result is an example of a rare combination—unsupervised, on-line, cumulative, structure learning. Unlike prediction suffix tree (PST) mixtures, the model learns with no size bound but using less space than the data. It does not repeatedly iterate over data (unlike MaxEnt feature construction). It discovers repeated structure on-line and (unlike PSTs) uses this to learn larger patterns. The type of repeated structure is limited (e.g., compared to hierarchical HMMs) but still useful, and these are important first steps towards learning repeated structure in more expressive representations, which has seen little progress especially in unsupervised, on-line contexts.

[1]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[2]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[3]  Paul E. Utgoff,et al.  Many-Layered Learning , 2002, Neural Computation.

[4]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[5]  Jianfeng Gao,et al.  Language model size reduction by pruning and clustering , 2000, INTERSPEECH.

[6]  Richard Fikes,et al.  On-line learning of predictive compositional hierarchies , 2002 .

[7]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[8]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[9]  Yoram Singer,et al.  Beyond Word N-Grams , 1996, VLC@ACL.

[10]  James L. McClelland,et al.  Autonomous Mental Development by Robots and Animals , 2001, Science.

[11]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[12]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[13]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[14]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[15]  Kevin P. Murphy,et al.  Linear-time inference in Hierarchical HMMs , 2001, NIPS.

[16]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[17]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[19]  Hinrich Schütze,et al.  Part-of-Speech Tagging Using a Variable Memory Markov Model , 1994, ACL.

[20]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[21]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[22]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.