Efficient Algorithm for Learning Simple Regular Expressions from Noisy Examples

We present an efficient algorithm for finding approximate repetitions in a given sequence of characters. First, we define a class of simple regular expressions which are of star-height one and do not contain union operations, and a stochastic mutation process of a given length over a string of characters. Then, assuming that a given string of characters is obtained corrupted by the defined mutation process from some long enough word generated by a simple regular expression, we try to restore the expression. We prove that to within some reasonable accuracy it is always possible if the length of the mutation process is bounded comparing to the length of the example. We provide an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in the length of the expression. We discuss some extensions of the method and possible applications to bioinformatics.

[1]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[2]  Satoru Miyano Learning Theory Toward Genome Informatics , 1995, IEICE Trans. Inf. Syst..

[3]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[4]  Joel I. Seiferas,et al.  Correcting Counter-Automaton-Recognizable Languages , 1978, SIAM J. Comput..

[5]  Alfred V. Aho,et al.  Pattern Matching in Strings , 1980 .

[6]  M. Schützenberger,et al.  The equation $a^M=b^Nc^P$ in a free group. , 1962 .

[7]  Maxine F. Singer,et al.  Genes and genomes , 1990 .

[8]  A. Brazma Efficient identification of regular expressions from representative examples , 1993, COLT '93.

[9]  Akihiko Konagaya,et al.  A Stochastic Approach to Genetic Information Processing , 1992, ALT.

[10]  Karlis Cerans,et al.  Efficient Learning of Regular Expressions from Good Examples , 1994, AII/ALT.

[11]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[12]  Alvis Brazma,et al.  Learning a Subclass of Regular Expressions by Recognizing Periodic Repetitions , 1993, Scandinavian Conference on AI.

[13]  C DeLisi,et al.  Computers in molecular biology: current applications and emerging trends. , 1988, Science.

[14]  Robert H. Sloan,et al.  Corrigendum to types of noise in data for concept learning , 1988, COLT '92.

[15]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[16]  Noriyuki Tanida,et al.  Polynomial-Time Identification of Strictly Regular Languages in the Limit , 1992 .