Discretization Numerical Data for Relational Data with One-to-Many Relations

Problem statement: Handling numerical data stored in a relational data base has been performed differently from handling those numerical data stored in a single table due to the multiple occurrences (one-to-many association) of an individ ual record in the non-target table and non-determin ate relations between tables. Numbers in Multi-Relation al Data Mining (MRDM) were often discretized, after considering the schema of the relational data base. Study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. Approach: Different alternatives for dealing with continuous attributes in MRDM were considered in this study, namely equal-width (EWD), Equal-Height (EH), equal-weight (EWG) and Entropy-Based (EB). The discretization procedures considered in this study included algorithms that were not depended on the multi-relational structure of the data and also tha t are sensitive to this structure. A new method of discretization, called the entropy instance-based ( EIB) discretization method was implemented and evaluated with respect to C4.5 on the two well-known multi-relational databases that include the Mutagenesis dataset and the Hepatitis dataset for D iscovery Challenge PKDD 2005. Results: When the number of bins, b, is big (b = 8), the entropy- instance-based discretization method produced bette r data summarization results compared to the other di scretization methods, in the mutagenesis dataset. I n contrast, for the hepatitis dataset, the entropy-in stance-based discretization method produced better data summarization results for all values of b, com pared to the other discretization methods. In the Hepatitis dataset, all discretization methods produ ced higher average performance accuracy (%) for partitional clustering technique, compared to the h ierarchical technique. Conclusion: These results demonstrated that entropy-based discretization can be improved by taking into consideration the multiple-instance problem. It was also found that t he partitional clustering technique produced better performance accuracy compared to the one produced by hierarchical clustering technique.

[1]  Arno J. Knobbe,et al.  Numbers in Multi-relational Data Mining , 2005, PKDD.

[2]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[3]  Rayner Alfred,et al.  Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA , 2007, ADBIS Research Communications.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Shusaku Tsumoto,et al.  Knowledge discovery in clinical databases and evaluation of discovered knowledge in outpatient clinic , 2000, Inf. Sci..

[6]  Luc De Raedt,et al.  On Multi-class Problems and Discretization in Inductive Logic Programming , 1997, ISMIS.

[7]  Rayner Alfred,et al.  A Genetic-Based Feature Construction Method for Data Summarisation , 2008, ADMA.

[8]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[9]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[16]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[17]  Peter A. Flach,et al.  A first-order representation for knowledge discovery and Bayesian classification on relational data , 2007 .