Data Complexity Issues in Grammatical Inference

Grammatical inference (also known as grammar induction) is a field transversal to a number of research areas including machine learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology, and speech recognition. Specificities of the problems that are studied include those related to data complexity. We argue that there are three levels at which data complexity for grammatical inference can be studied: at the first (inner) level the data can be strings, trees, or graphs; these are nontrivial objects on which topologies may not always be easy to manage. A second level is concerned with the classes and the representations of the classes used for classification; formal language theory provides us with an elegant setting based on rewriting systems and recursivity, but which is not easy to work with for classification or learning tasks. The combinatoric problems usually attached to these tasks prove to be indeed difficult. The third level relates the objects to the classes. Membership may be problematic, and this is even more the case when approximations (of the strings or the languages) are used, for instance in a noisy setting. We argue that the main difficulties arise from the fact that the structural definitions of the languages and the topological measures do not match.

[1]  Colin de la Higuera,et al.  Distances between Distributions: Comparing Language Models , 2004, SSPR/SPR.

[2]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[3]  Jean-Yves Giordano Inference of Context-Free Grammars by Enumeration: Structural Containment as an Ordering Bias , 1994, ICGI.

[4]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[5]  Amaury Habrard,et al.  Generalized Stochastic Tree Automata for Multi-relational Data Mining , 2002, ICGI.

[6]  Fred J. Maryanski,et al.  Properties of stochastic syntax-directed translation schemata , 1979, International Journal of Computer & Information Sciences.

[7]  Horst Bunke,et al.  Syntactic and Structural Pattern Recognition , 1988, NATO ASI Series.

[8]  Francisco Casacuberta,et al.  Probabilistic finite-state machines - part I , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Rajesh Parekh,et al.  A Polynominal Time Incremental Algorithm for Learning DFA , 1998, ICGI.

[10]  Yasubumi Sakakibara,et al.  GA-based Learning of Context-Free Grammars using Tabular Representations , 1999, International Conference on Machine Learning.

[11]  Jorge Calera-Rubio,et al.  Computing the Relative Entropy Between Regular Tree Languages , 1998, Inf. Process. Lett..

[12]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[13]  Horst Bunke,et al.  Syntactic and structural pattern recognition : theory and applications , 1990 .

[14]  Dana Angluin,et al.  Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[15]  Colin de la Higuera,et al.  Inference of [omega]-languages from prefixes , 2004, Theor. Comput. Sci..

[16]  Pat Langley,et al.  Learning Context-Free Grammars with a Simplicity Bias , 2000, ECML.

[17]  Michael A. Harrison,et al.  Introduction to formal language theory , 1978 .

[18]  Colin de la Higuera,et al.  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[19]  King-Sun Fu,et al.  Syntactic Methods in Pattern Recognition , 1974, IEEE Transactions on Systems, Man, and Cybernetics.

[20]  Hermann Ney,et al.  Stochastic Grammars and Pattern Recognition , 1992 .

[21]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[22]  Leslie G. Valiant,et al.  Cryptographic limitations on learning Boolean formulae and finite automata , 1994, JACM.

[23]  Enrique Vidal,et al.  Identification of DFA: data-dependent vs data-independent algorithms , 1996, ICGI.

[24]  Azaria Paz,et al.  Probabilistic automata , 2003 .

[25]  S. C. Kremer Natural properties in an artificial neural network , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[26]  Azaria Paz,et al.  Introduction to Probabilistic Automata , 1971 .

[27]  Yasubumi Sakakibara,et al.  Recent Advances of Grammatical Inference , 1997, Theor. Comput. Sci..

[28]  José M. Sempere,et al.  A Characterization of Even Linear Languages and its Application to the Learning Problem , 1994, ICGI.

[29]  Henning Fernau Learning Tree Languages from Text , 2002, COLT.

[30]  Laurent Miclet,et al.  Structural Methods in Pattern Recognition , 1986 .

[31]  S. C. Kremer Parallel stochastic grammar induction , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[32]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[33]  Yasubumi Sakakibara,et al.  Efficient Learning of Context-Free Grammars from Positive Structural Examples , 1992, Inf. Comput..

[34]  Yasubumi Sakakibara,et al.  Learning Context-Free Grammars from Partially Structured Examples , 2000, ICGI.

[35]  DANA ANGLUIN,et al.  On the Complexity of Minimum Inference of Regular Sets , 1978, Inf. Control..

[36]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[37]  Ricard Gavaldà,et al.  On the power of equivalence queries , 1994, EuroCOLT.

[38]  Menno van Zaanen,et al.  Proceedings of the Workshop and Tutorial on Learning Context-Free Grammars , 2003 .

[39]  Jorge Calera-Rubio,et al.  Stochastic Inference of Regular Tree Languages , 2004, Machine Learning.

[40]  Juan Ramón Rico-Juan,et al.  Stochastic k-testable Tree Languages and Applications , 2002, ICGI.

[41]  Esko Ukkonen,et al.  Pattern Discovery in Biosequences , 1998, ICGI.

[42]  Richard K. Belew,et al.  Stochastic Context-Free Grammar Induction with a Genetic Algorithm Using Local Search , 1996, FOGA.

[43]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[44]  Takeshi Koshiba,et al.  Inferring pure context-free languages from positive data , 2000, Acta Cybern..

[45]  Colin de la Higuera,et al.  Current Trends in Grammatical Inference , 2000, SSPR/SPR.

[46]  Francisco Casacuberta,et al.  Topology of Strings: Median String is NP-Complete , 1999, Theor. Comput. Sci..

[47]  Leonard Pitt,et al.  The minimum consistent DFA problem cannot be approximated within any polynomial , 1993, JACM.

[48]  Erkki Mäkinen,et al.  A Note on the Grammatical Inference Problem for Even Linear Languages , 1996, Fundam. Informaticae.

[49]  Luisa Micó,et al.  A modification of the LAESA algorithm for approximated k-NN classification , 2003, Pattern Recognit. Lett..

[50]  J. Oncina Inference of recognizable tree sets , 2003 .

[51]  J. Van Leeuwen,et al.  Handbook of theoretical computer science - Part A: Algorithms and complexity; Part B: Formal models and semantics , 1990 .

[52]  Manfred K. Warmuth Towards Representation Independence in PAC Learning , 1989, AII.

[53]  Boris A. Trakhtenbrot,et al.  Finite automata : behavior and synthesis , 1973 .

[54]  Dana Angluin,et al.  Finding patterns common to a set of strings (Extended Abstract) , 1979, STOC.

[55]  Timo Knuutila,et al.  The Inference of Tree Languages from Finite Samples: An Algebraic Approach , 1994, Theor. Comput. Sci..

[56]  Zhiyi Chi,et al.  Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[57]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[58]  Christian N. S. Pedersen,et al.  The consensus string problem and the complexity of comparing hidden Markov models , 2002, J. Comput. Syst. Sci..

[59]  Yuji Takada Grammatical Interface for Even Linear Languages Based on Control Sets , 1988, Inf. Process. Lett..

[60]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[61]  Horst Bunke,et al.  Advances In Structural And Syntactic Pattern Recognition , 1993 .

[62]  Leonard Pitt,et al.  Reductions among prediction problems: on the difficulty of predicting automata , 1988, [1988] Proceedings. Structure in Complexity Theory Third Annual Conference.

[63]  C. S. Wetherell,et al.  Probabilistic Languages: A Review and Some Open Questions , 1980, CSUR.

[64]  Kurt VanLehn,et al.  A Version Space Approach to Learning Context-free Grammars , 1987, Machine Learning.

[65]  A. Pnueli,et al.  On the learnability of infinitary regular sets , 1991, COLT 1991.

[66]  Ahmed Saoudi,et al.  Learning local and recognizable Ω-languages and monadic logic programs , 1994, EuroCOLT.

[67]  Michael G. Thomason,et al.  Syntactic Pattern Recognition, An Introduction , 1978, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Colin de la Higuera,et al.  Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[69]  Pedro García,et al.  IDENTIFYING REGULAR LANGUAGES IN POLYNOMIAL TIME , 1993 .

[70]  Christian N. S. Pedersen,et al.  Comparing a Hidden Markov Model and a Stochastic Context-Free Grammar , 2001, WABI.

[71]  B. Natarajan Machine Learning: A Theoretical Approach , 1992 .

[72]  Lillian Lee,et al.  Learning of Context-Free Languages: A Survey of the Literature , 1996 .

[73]  Thomas Erlebach,et al.  Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries , 2001, Theor. Comput. Sci..

[74]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[75]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[76]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey - Part I , 1975, IEEE Trans. Syst. Man Cybern..

[77]  Rafael C. Carrasco Accurate Computation of the Relative Entropy Between Stochastic Regular Grammars , 1997, RAIRO Theor. Informatics Appl..

[78]  D. Angluin Queries and Concept Learning , 1988 .

[79]  D. Angluin Negative Results for Equivalence Queries , 1990, Machine Learning.

[80]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[81]  Klaus P. Jantke,et al.  Analogical and Inductive Inference , 1986, Lecture Notes in Computer Science.

[82]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[83]  Yuji Takada,et al.  A Hierarchy of Language Families Learnable by Regular Language Learners , 1994, ICGI.

[84]  Yasubumi Sakakibara,et al.  Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[85]  Juan Ramón Rico-Juan,et al.  Comparison of AESA and LAESA search algorithms using string and tree-edit-distances , 2003, Pattern Recognit. Lett..