Mining Chemical Compound Structure Data Using Inductive Logic Programming

Discovering knowledge from chemical compound structure data is a challenge task in KDD. It aims to generate hypotheses describing activities or characteristics of chemical compounds from their own structures. Since each compound composes of several parts with complicated relations among them, traditional mining algorithms cannot handle this kind of data efficiently. In this research, we apply Inductive Logic Programming (ILP) for classifying chemical compounds. ILP provides comprehensibility to learning results and capability to handle more complex data consisting of their relations. Nevertheless, the bottleneck for learning first-order theory is enormous hypothesis search space which causes inefficient performance by the existing learning approaches compared to the propositional approaches. We introduces an improved ILP approach capable of handling more efficiently a kind of data called multiple-part data, i.e., one instance of data consists of several parts as well as relations among parts. The approach tries to find hypothesis describing class of each training example by using both individual and relational characteristics of its part which is similar to finding common substructures among the complex relational instances. Chemical compound data is multiple-part data. Each compound is composed of atoms as parts, and various kinds of bond as relations among atoms. We then apply the proposed algorithm for chemical compound structure by conducting experiments on two real-world datasets: mutagenicity in nitroaromatic compounds and dopamine antagonist compounds. The experiment results were compared to the previous approaches in order to show the performance of proposed approach.