Evolutionary Grammatical Inference

INTRODUCTION Grammatical Inference (also known as grammar induction) is the problem of learning a grammar for a language from a set of examples. In a broad sense, some data is presented to the learner that should return a grammar capable of explaining to some extent the input data. The grammar inferred from data can then be used to classify unseen data or provide some suitable model for it. The classical formalization of Grammatical Inference (GI) is known as Language Identification in the Limit (Gold, 1967). Here, there are a finite set S + of strings known to belong to the language L (the positive examples) and another finite set S-of strings not belonging to L (the negative examples). The language L is said to be identifiable in the limit if there exists a procedure to find a grammar G such that S + ⊆ L(G), S-⊄ L(G) and, in the limit, for sufficiently large S + and S-, L = L(G). The disjoint sets S + and S-are given to provide clues for the inference of the production rules P of the unknown grammar G used to generate the language L. Grammatical inference include such diverse fields as speech and natural language processing, gene analysis , pattern recognition, image processing, sequence prediction, information retrieval, cryptography, and many more. An excellent source for a state-of-the art overview of the subject is provided in (de la Higuera, 2005). Traditionally, most work in GI has been focused on the inference of regular grammars trying to induce finite-state automata, which can be efficiently learned. For context free languages some recent approaches have shown limited success (Starckie, Costie & Zaanen, 2004), because the search space of possible grammars is infinite. Basically, the parenthesis and palindrome languages are common test cases for the effectiveness of grammatical inference methods. Both languages are context-free. The parenthesis language is deterministic but the palindrome language is nondeterministic (de la Higuera, 2005). The use of evolutionary methods for context-free grammatical inference are not new, but only a few attempts have been successful. Wyard (1991) used Genetic Algorithm (GA) to infer grammars for the language of correctly balanced and nested parentheses with success, but fails on the language of sentences containing the same number of a's and b's (a n b n language). In another attempt (Wyard, 1994), he obtained positive results on the inference of two classes of context-free grammars: the …