Voting experts: An unsupervised algorithm for segmenting sequences

We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a chunk is a measure of its internal entropy. The Voting-Experts exploits the signature of chunks to find word boundaries in text from four languages and episode boundaries in the activities of a mobile robot.

[1]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[2]  Paul R. Cohen,et al.  An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes , 2001, IDA.

[3]  M. Hauser,et al.  Segmentation of the speech stream in a non-human primate: statistical learning in cotton-top tamarins , 2001, Cognition.

[4]  Haym Hirsh,et al.  Learning to Predict Rare Events in Event Sequences , 1998, KDD.

[5]  H. Simon,et al.  The mind's eye in chess. , 1973 .

[6]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[7]  Paul R. Cohen,et al.  An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes , 2002, Pattern Detection and Discovery.

[8]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[9]  Allen Newell,et al.  Chunking in Soar: The anatomy of a general learning mechanism , 1985, Machine Learning.

[10]  Michael Mitzenmacher,et al.  The MARKOV EXPERT for finding episodes in time series , 2005, Data Compression Conference.

[11]  Herbert A. Simon,et al.  The Roles of Recognition Processes and Look-Ahead Search in Time-Constrained Expert Problem Solving: Evidence From Grand-Master-Level Chess , 1996 .

[12]  Michael D. Alder,et al.  Finding Structure via Compression , 1998, CoNLL.

[13]  Xiaopeng Tao,et al.  Chinese Text Segmentation With MBDP-1: Making the Most of Training Corpora , 2001, ACL.

[14]  Paul R. Cohen,et al.  Grounding knowledge in sensors: unsupervised learning for language and planning , 2001 .

[15]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[16]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[17]  T. Poggio,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[18]  Carl de Marcken,et al.  The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[19]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[20]  Lillian Lee,et al.  Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji , 2000, ANLP.

[21]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[22]  Graham J. Williams,et al.  Temporal Event Mining of Linked Medical Claims Data , 2003 .

[23]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[24]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[25]  Richard Alan Peters,et al.  Robonaut task learning through teleoperation , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[26]  Mathias Creutz Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency , 2003, ACL.

[27]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[28]  André Kempe,et al.  Experiments in Unsupervised Entropy-Based Corpus Segmentation , 1999, CoNLL.

[29]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[30]  Frederick Jelinek,et al.  Basic Methods of Probabilistic Context Free Grammars , 1992 .

[31]  H A Simon,et al.  How Big Is a Chunk? , 1974, Science.

[32]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[33]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[34]  P.K Sahoo,et al.  A survey of thresholding techniques , 1988, Comput. Vis. Graph. Image Process..

[35]  W. Fitch,et al.  Computational Constraints on Syntactic Processing in a Nonhuman Primate , 2004, Science.

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[37]  Gal A. Kaminka,et al.  Improving sequence recognition for learning the behavior of agents , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..