An ILP Refinement Operator for Biological Grammar Learning

We are interested in using Inductive Logic Programming ( ILP ) to infer grammars representing sets of biological sequences. We call these biological grammars. ILP systems are well suited to this task in the sense that biological grammars have been represented as logic programs using the Definite Clause Grammar or the String Variable Grammar formalisms. However, the speed at which ILP systems can generate biological grammars has been shown to be a bottleneck. This paper presents a novel refinement operator implementation, specialised to infer biological grammars with ILP techniques. This implementation is shown to significantly speed-up inference times compared to the use of the classical refinement operator: time gains larger than 5-fold were observed in $\frac{4}{5}$ of the experiments, and the maximum observed gain is over 300-fold.

[1]  Stephen Muggleton,et al.  Learning from Positive Data , 1996, Inductive Logic Programming Workshop.

[2]  Chris Mellish,et al.  Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences , 2001, Bioinform..

[3]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[4]  Christopher H. Bryant,et al.  A parser for the efficient induction of biological grammars , 2005 .

[5]  Birgit Tausend,et al.  Representing Biases for Inductive Logic Programming , 1994, ECML.

[6]  David B. Searls,et al.  String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA , 1995, J. Log. Program..

[7]  Stephen Muggleton Inductive Logic Programming: 6th International Workshop, ILP-96, Stockholm, Sweden, August 26-28, 1996, Selected Papers , 1997 .

[8]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[9]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[10]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[11]  Stephen Pulman,et al.  Grammar learning using Inductive Logic Programming , 2001 .

[12]  David H. D. Warren,et al.  Definite Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks , 1980, Artif. Intell..

[13]  Luc De Raedt,et al.  Machine Learning: ECML-94 , 1994, Lecture Notes in Computer Science.

[14]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[15]  R. Overbeek,et al.  Searching for patterns in genomic data. , 1997, Trends in genetics : TIG.

[16]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[17]  Christopher H. Bryant,et al.  Pertinent Background Knowledge for Learning Protein Grammars , 2006, ECML.

[18]  Stephen G. Pulman,et al.  Experiments in Inductive Chart Parsing , 1999, Learning Language in Logic.

[19]  Ashwin Srinivasan,et al.  Are Grammatical Representations Useful for Learning from Biological Sequence Data? - A Case Study , 2001, J. Comput. Biol..

[20]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[21]  Ian Holmes,et al.  Dynamic Programming Alignment Accuracy , 1998, J. Comput. Biol..