Learning Regular Expressions from Noisy Sequences

The presence of long gaps dramatically increases the diffculty of detecting and characterizing complex events hidden in long sequences. In order to cope with this problem, a learning algorithm based on an abstraction mechanism is proposed: it can infer the general model of complex events from a set of learning sequences. Events are described by means of regular expressions, and the abstraction mechanism is based on the substitution property of regular languages. The induction algorithm proceeds bottom-up, progressively coarsening the sequence granularity, letting correlations between subsequences, separated by long gaps, naturally emerge. Two abstraction operators are defined. The first one detects, and abstracts into non-terminal symbols, regular expressions not containing iterative constructs. The second one detects and abstracts iterated subsequences. By interleaving the two operators, regular expressions in general form may be inferred. Both operators are based on string alignment algorithms taken from bio-informatics. A restricted form of the algorithm has already been outlined in previous papers, where the emphasis was on applications. Here, the algorithm, in an extended version, is described and analyzed into details.

[1]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[2]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  King-Sun Fu,et al.  Syntactic Pattern Recognition And Applications , 1968 .

[4]  A. Giordana,et al.  Discovering Complex Events in Long Sequences , 2002 .

[5]  Jeffrey D. Ullman,et al.  Formal languages and their relation to automata , 1969, Addison-Wesley series in computer science and information processing.

[6]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[7]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[8]  Yoshua Bengio,et al.  An EM approach to grammatical inference: input/output HMMs , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[9]  Kevin P. Murphy,et al.  Linear-time inference in Hierarchical HMMs , 2001, NIPS.

[10]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey-Part I , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Shih-Fu Chang,et al.  Learning Hierarchical Hidden Markov Models for Video Structure Discovery , 2003 .

[12]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey - Part I , 1975, IEEE Trans. Syst. Man Cybern..

[13]  J. Feldman,et al.  Learning Automata from Ordered Examples , 1991 .

[14]  Rajesh Parekh,et al.  A Polynominal Time Incremental Algorithm for Learning DFA , 1998, ICGI.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  E. Myers,et al.  Approximate matching of regular expressions , 1989 .

[17]  Attilio Giordana,et al.  Learning User Profile from Traces , 2005, 2005 Symposium on Applications and the Internet Workshops (SAINT 2005 Workshops).

[18]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[19]  Rajesh Parekh,et al.  Learning DFA from Simple Examples , 1997, Machine Learning.

[20]  Paolo Terenziani,et al.  Recognizing and Discovering Complex Events in Sequences , 2002, ISMIS.

[21]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[22]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[23]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[24]  Marco Botta,et al.  Learning Complex and Sparse Events in Long Sequences , 2004, ECAI.

[25]  J. Elman Distributed Representations, Simple Recurrent Networks, And Grammatical Structure , 1991 .