Estudo Comparativo entre Proposicionalização e Mineração de Dados Multidimensional sobre um Banco de Dados Relacional

Propositionalization and multidimensional data mining are the two main approaches applied in a relational database during the pre-processing stage of a knowledge discovery project for relational classi cation. Much has been discussed whether there are di erences between them on the nal performance of the intelligent system, however, few studies have been performed with public data of real problems to help resolve this issue. This paper presents a preliminary performance comparison between these two approaches, applied to the database from a known benchmark of an international competition organized by PKDD 1999, for a binary classi cation problem in the credit risk domain. The comparison performed using the strati ed cross-validation process was repeated 10 times to set con dence interval for the evaluation of performance measured by the statistical maximum value of the Kolmogorov-Smirnov curve (KS2), using a Multilayer Perceptron neural network as classi er. The one-tailed paired t-test showed that the Propositionalization approach gives better performance to the nal classi er with a con dence level of 95%. Resumo. Proposicionalização e mineração de dados multidimensional são as duas principais abordagens aplicadas em um banco de dados relacional durante a fase de pré-processamento em um projeto de descoberta do conhecimento para classi cação relacional. Muito tem sido discutido se há diferença entre eles no desempenho do sistema inteligente nal, porém poucos trabalhos foram realizados com dados públicos de problemas reais para ajudar a resolver esta questão. Este trabalho apresenta uma comparação de desempenho preliminar entre essas duas abordagens, aplicadas ao banco de dados de um conhecido benchmark da competição internacional organizada pela PKDD 1999, para um problema de classi cação binária no domínio de análise de risco de crédito. A comparação foi realizada através do processo de validação cruzada estrati cada, repetido 10 vezes para de nir os intervalos de con ança para a avaliação de desempenho, medido pela estatística de máximo valor da curva Kolmogorov-Smirnov (KS2), utilizando uma rede neural MultiLayer Perceptron como classi cador. O teste t-Student emparelhado unicaudal mostrou que a abordagem de proposicionalização gera um melhor desempenho ao modelo nal com o nível de con ança de 95%.

[1]  Philip S. Yu,et al.  CrossMine: Efficient Classification Across Multiple Database Relations , 2004, Constraint-Based Mining and Inductive Databases.

[3]  Chengqi Zhang,et al.  Combined Mining: Discovering Informative Knowledge in Complex Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[5]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[6]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[7]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Jeffrey Ng,et al.  A Survey of Architecture and Function of the Primary Visual Cortex (V1) , 2007, EURASIP J. Adv. Signal Process..

[9]  Chengqi Zhang,et al.  The Evolution of KDD: towards Domain-Driven Data Mining , 2007, Int. J. Pattern Recognit. Artif. Intell..

[10]  David West,et al.  Neural network credit scoring models , 2000, Comput. Oper. Res..

[11]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[12]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[13]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[14]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[17]  Bernhard Pfahringer,et al.  A Toolbox for Learning from Relational Data with Propositional and Multi-instance Learners , 2004, Australian Conference on Artificial Intelligence.

[18]  Herna L. Viktor,et al.  Mining relational data through correlation-based multiple view validation , 2006, KDD '06.

[19]  Germano C. Vasconcelos,et al.  Neural Networks vs Logistic Regression: a Comparative Study on a Large Data Set , 2004, ICPR.

[20]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[21]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .