Learning first-order rules from data with multiple parts: applications on mining chemical compound data

Inductive learning of first-order theory based on examples has serious bottleneck in the enormous hypothesis search space needed, making existing learning approaches perform poorly when compared to the propositional approach. Moreover, in order to choose the appropiate candidates, all Inductive Logic Programming (ILP) systems only use quantitive information, e.g. number of examples covered and length of rules, which is insufficient for search space having many similar candidates. This paper introduces a novel approach to improve ILP by incorporating the qualitative information into the search heuristics by focusing only on a kind of data where one instance consists of several parts, as well as relations among parts. This approach aims to find the hypothesis describing each class by using both individual and relational characteristics of parts of examples. This kind of data can be found in various domains, especially in representing chemical compound structure. Each compound is composed of atoms as parts, and bonds as relations between two atoms. We apply the proposed approach for discovering rules describing the activity of compounds from their structures from two real-world datasets: mutagenicity in nitroaromatic compounds and dopamine antagonist compounds. The results were compared to the existing method using ten-fold cross validation, and we found that the proposed method significantly produced more accurate results in prediction.