Concept discovery on relational databases: New techniques for search space pruning and rule quality improvement

Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, we introduce an ILP-based concept discovery framework named Concept Rule Induction System (CRIS) which includes new approaches for search space pruning and new features, such as defining aggregate predicates and handling numeric attributes, for rule quality improvement. In CRIS, all target instances are considered together, which leads to construction of more descriptive rules for the concept. This property also makes it possible to use aggregate predicates more accurately in concept rule construction. Moreover, it facilitates construction of transitive rules. A set of experiments is conducted in order to evaluate the performance of proposed method in terms of accuracy and coverage.

[1]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[2]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3]  Yen-Liang Chen,et al.  A phenotypic genetic algorithm for inductive logic programming , 2009, Expert Syst. Appl..

[4]  T. Jones,et al.  On the rodent bioassays currently being conducted on 44 chemicals: a RASH analysis to predict test results from the National Toxicology Program. , 1991, Mutagenesis.

[5]  Bojan Dolsak,et al.  The Application of Inductive Logic Programming to Finite Element Mesh Design , 1992 .

[6]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[7]  Saso Dzeroski,et al.  Diterpene Structure Elucidation from 13CNMR Spectra with Inductive Logic Programming , 1998, Appl. Artif. Intell..

[8]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[9]  Jack Minker,et al.  On Indefinite Databases and the Closed World Assumption , 1987, CADE.

[10]  Foster J. Provost,et al.  Aggregation-based feature invention and relational concept classes , 2003, KDD '03.

[11]  Arno J. Knobbe,et al.  Involving Aggregate Functions in Multi-relational Search , 2002, PKDD.

[12]  C. Feng,et al.  Temporal Decision Trees: Model-based Diagnosis of Dynamic Systems On-Board , 2003, J. Artif. Intell. Res..

[13]  Dennis Bahler,et al.  The Induction of Rules for Predicting Chemical Carcinogenesis in Rodents , 1993, ISMB.

[14]  Ismail Hakki Toroslu,et al.  ILP-based concept discovery in multi-relational data mining , 2009, Expert Syst. Appl..

[15]  R Benigni Predicting chemical carcinogenesis in rodents: the state of the art in light of a comparative exercise. , 1995, Mutation research.

[16]  Luc Dehaspe,et al.  Discovery of relational association rules , 2001 .

[17]  Qiang Wu,et al.  Real formal concept analysis based on grey-rough set theory , 2009, Knowl. Based Syst..

[18]  Ismail Hakki Toroslu,et al.  Aggregation in Confidence-Based Concept Discovery for Multi-Relational Data Mining , 2008, IADIS European Conf. Data Mining.

[19]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[20]  John Grant,et al.  PRL: A probabilistic relational language , 2006, Machine Learning.

[21]  Luc Dehaspe Frequent Pattern Discovery in First-Order Logic , 1999, AI Commun..

[22]  Evelina Lamma,et al.  Integrating Induction and Abduction in Logic Programming , 1999, Inf. Sci..

[23]  Ashwin Srinivasan,et al.  Relating chemical activity to structure: An examination of ILP successes , 1995, New Generation Computing.

[24]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[25]  Pedro M. Domingos Prospects and challenges for multi-relational data mining , 2003, SKDD.

[26]  Mahmut Uludag,et al.  A new relational learning system using novel rule selection strategies , 2006, Knowl. Based Syst..

[27]  Ismail Hakki Toroslu,et al.  Confidence-based Concept Discovery in Multi-Relational Data Mining , 2008 .

[28]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[29]  Ashwin Srinivasan,et al.  Carcinogenesis Predictions Using ILP , 1997, ILP.

[30]  Stephen Muggleton,et al.  Learning from Positive Data , 1996, Inductive Logic Programming Workshop.

[31]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[32]  Luc De Raedt,et al.  Mining Association Rules in Multiple Relations , 1997, ILP.

[33]  Vladimir Lifschitz,et al.  Closed-World Databases and Circumscription , 1987, Artif. Intell..

[34]  Ismail Hakki Toroslu,et al.  Data mining in deductive databases using query flocks , 2005, Expert Syst. Appl..

[35]  Ryszard S. Michalski,et al.  Inductive inference of VL decision rules , 1977, SGAR.

[36]  Ismail Hakki Toroslu,et al.  Confidence-Based Concept Discovery in Relational Databases , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[37]  K. Enslein,et al.  Prediction of probability of carcinogenicity for a set of ongoing NTP bioassays. , 1990, Mutagenesis.

[38]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[39]  R. Tennant,et al.  Prediction of the outcome of rodent carcinogenicity bioassays currently being conducted on 44 chemicals by the National Toxicology Program. , 1990, Mutagenesis.

[40]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[41]  G. Bakale,et al.  Prospective ke screening of potential carcinogens being tested in rodent bioassays by the US National Toxicology Program. , 1992, Mutagenesis.

[42]  Ashwin Srinivasan,et al.  The Predictive Toxicology Evaluation Challenge , 1997, IJCAI.

[43]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[44]  Wei-Pang Yang,et al.  An approach to mining the multi-relational imbalanced database , 2008, Expert Syst. Appl..

[45]  Ismail Hakki Toroslu,et al.  Analyzing Transitive Rules on a Hybrid Concept Discovery System , 2009, HAIS.

[46]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[47]  D. Sanderson,et al.  Computer Prediction of Possible Toxic Action from Chemical Structure; The DEREK System , 1991, Human & experimental toxicology.

[48]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[49]  Xiaodong Liu,et al.  A new model of evaluating concept similarity , 2008, Knowl. Based Syst..

[50]  Héctor Ariel Leiva,et al.  MRDTL: A multi-relational decision tree learning algorithm , 2002 .

[51]  C. McDiarmid SIMULATED ANNEALING AND BOLTZMANN MACHINES A Stochastic Approach to Combinatorial Optimization and Neural Computing , 1991 .

[52]  Mitsuru Ishizuka,et al.  A creative abduction approach to scientific and knowledge discovery , 2005, Knowl. Based Syst..

[53]  Ismail Hakki Toroslu,et al.  Multi-relational concept discovery with aggregation , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[54]  Saso Dzeroski,et al.  First order random forests: Learning relational classifiers with complex aggregates , 2006, Machine Learning.

[55]  Mathieu Serrurier,et al.  Introducing possibilistic logic in ILP for dealing with exceptions , 2007, Artif. Intell..

[56]  Bojan Dolsak,et al.  Finite element mesh design expert system , 2002, Knowl. Based Syst..

[57]  Mathieu Serrurier,et al.  Improving inductive logic programming by using simulated annealing , 2008, Inf. Sci..

[58]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[59]  Gilles Richard,et al.  Characterization of bio-chemical signals by inductive logic programming , 2002, Knowl. Based Syst..

[60]  Stephen Muggleton,et al.  The Application of Inductive Logic Programming to Finite Element Mesh Design , 2007 .

[61]  D. Lewis,et al.  A prospective toxicity evaluation (COMPACT) on 40 chemicals currently being tested by the National Toxicology Program. , 1990, Mutagenesis.

[62]  S. Muggleton,et al.  The role of background knowledge : using a problemfrom chemistry to examine the performance of anILP program , 1996 .

[63]  Martin Ester,et al.  A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions , 2007, PKDD.

[64]  Stephen Muggleton,et al.  Inductive Logic Programming , 2011, Lecture Notes in Computer Science.

[65]  Stephen Muggleton Inductive Logic Programming: 6th International Workshop, ILP-96, Stockholm, Sweden, August 26-28, 1996, Selected Papers , 1997 .