Learning Balls of Strings from Edit Corrections

When facing the question of learning languages in realistic settings, one has to tackle several problems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few algorithms are not really able to cope with noise. On the other hand, the learning settings themselves rely either on too simple information (text) or on unattainable one (query systems that do not exist in practice, nor can be simulated). We consider simple but sound classes of languages defined via the widely used edit distance: the balls of strings. We propose to learn them with the help of a new sort of queries, called the correction queries: when a string is submitted to the Oracle, either she accepts it if it belongs to the target language, or she proposes a correction, that is, a string of the language close to the query with respect to the edit distance. We show that even if the good balls are not learnable in Angluin's MAT model, they can be learned from a polynomial number of correction queries. Moreover, experimental evidence simulating a human Expert shows that this algorithm is resistant to approximate answers.

[1]  DANA ANGLUIN,et al.  On the Complexity of Minimum Inference of Regular Sets , 1978, Inf. Control..

[2]  Yoshiko Wakabayashi,et al.  Pattern Inference under many Guises , 2003 .

[3]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[4]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[5]  Teuvo Kohonen,et al.  Median strings , 1985, Pattern Recognit. Lett..

[6]  Leonor Becerra-Bonache,et al.  Learning Mild Context-Sensitiveness: Toward Understanding Children's Language Learning , 2004, ICGI.

[7]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[8]  Olivier Gascuel,et al.  Hidden Markov Models with Patterns to Learn Boolean Vector Sequences and Application to the Built-In Self-Test for Integrated Circuits , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Borivoj Melichar Approximate String Matching by Finite Automata , 1995, CAIP.

[10]  Michael A. Arbib,et al.  An Introduction to Formal Language Theory , 1988, Texts and Monographs in Computer Science.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[13]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[14]  Aurélien Lemay,et al.  Interactive Learning of Node Selecting Tree Transducers ⋆ , 2010 .

[15]  Colin de la Higuera,et al.  Identification in the Limit of Systematic-Noisy Languages , 2006, ICGI.

[16]  Francisco Casacuberta,et al.  Use of median string for classification , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[17]  Arto Salomaa On Languages Defined by Numerical Parameters , 2007, Formal Models, Languages and Applications.

[18]  J. Sakarovitch Eléments de théorie des automates , 2003 .

[19]  Ah Chung Tsoi,et al.  Noisy Time Series Prediction using Recurrent Neural Networks and Grammatical Inference , 2001, Machine Learning.

[20]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[21]  Francisco Casacuberta,et al.  Topology of Strings: Median String is NP-Complete , 1999, Theor. Comput. Sci..

[22]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[23]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[24]  Colin de la Higuera,et al.  Learning Languages from Bounded Resources: The Case of the DFA and the Balls of Strings , 2008, ICGI.

[25]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[26]  Pierre Dupont,et al.  Smoothing Probabilistic Automata: An Error-Correcting Approach , 2000, ICGI.

[27]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[28]  B. Hayes,et al.  Rules vs. analogy in English past tenses: a computational/experimental study , 2003, Cognition.

[29]  Colin de la Higuera,et al.  Data Complexity Issues in Grammatical Inference , 2006 .

[30]  Leonor Becerra Bonache On the learnibility of Mildly Context-Sensitive languages using positive data and correction queries , 2006 .

[31]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[32]  Alexander Clark,et al.  Languages as Hyperplanes: Grammatical Inference with String Kernels , 2006, ECML.

[33]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[34]  Efim B. Kinber,et al.  On Learning Regular Expressions and Patterns Via Membership and Correction Queries , 2008, ICGI.

[35]  Dana Angluin Queries revisited , 2004, Theor. Comput. Sci..

[36]  Colin de la Higuera,et al.  Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[37]  Joachim Niehren,et al.  Interactive learning of node selecting tree transducer , 2006, Machine Learning.

[38]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[39]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[40]  Leonor Becerra-Bonache,et al.  Learning DFA from Correction and Equivalence Queries , 2006, ICGI.

[41]  Enrique Vidal,et al.  Language Simplification through Error-Correcting and Grammatical Inference Techniques , 2004, Machine Learning.

[42]  Hardi Hungar,et al.  Model Generation by Moderated Regular Extrapolation , 2002, FASE.

[43]  Dana Angluin,et al.  When won't membership queries help? , 1991, STOC '91.

[44]  Boris A. Trakhtenbrot,et al.  Finite automata : behavior and synthesis , 1973 .