Benchmarking Data Mining Algorithms

Data mining is the process of sifting through the mass of organizational (internal and external) data to identify patterns critical for decision support. Successful implementation of the data mining effort requires a careful assessment of the various tools and algorithms available. The basic premise of this study is that machine-learning algorithms, which are assumption free, should outperform their traditional counterparts when mining business databases. The objective of this study is to test this proposition by investigating the performance of the algorithms for several scenarios. The scenarios are based on simulations designed to reflect the extent to which typical statistical assumptions are violated in the business domain. The results of the computational experiments support the proposition that machine learning algorithms generally outperform their statistical counterparts under certain conditions. These can be used as prescriptive guidelines for the applicability of data mining techniques.

[1]  Lawrence Davis,et al.  A Hybrid Genetic Algorithm for Classification , 1991, IJCAI.

[2]  Balaji Rajagopalan,et al.  Financial decision support with hybrid genetic and neural based modeling tools , 1997 .

[3]  David A. Hsieh,et al.  The statistical properties of daily foreign exchange rates: 1974–1983 , 1988 .

[4]  Melody Y. Kiang,et al.  Managerial Applications of Neural Networks: The Case of Bank Failure Predictions , 1992 .

[5]  Pandu R. Tadikamalla,et al.  On simulating non-normal distributions , 1980 .

[6]  Benjamin Van Roy,et al.  Solving Data Mining Problems Through Pattern Recognition , 1997 .

[7]  Michael Y. Hu,et al.  Two-Group Classification Using Neural Networks* , 1993 .

[8]  Darryn J. Reid,et al.  Enhanced genetic operators for the resolution of discrete constrained optimization problems , 1997, Comput. Oper. Res..

[9]  O. Coskunoglu,et al.  A New Logit Model for Decision Making and its Application , 1985 .

[10]  Ingoo Han,et al.  Hybrid neural network models for bankruptcy predictions , 1996, Decis. Support Syst..

[11]  Franklin Allen,et al.  Using genetic algorithms to find technical trading rules , 1999 .

[12]  B. LeBaron,et al.  Nonlinear Dynamics, Chaos, and Instability: Statistical Theory and Economic Evidence , 1991 .

[13]  Allen I. Fleishman A method for simulating non-normal distributions , 1978 .

[14]  R. Leuthold,et al.  Commodity futures price changes: Recent evidence for wheat, soybeans and live cattle , 1987 .

[15]  C. D. Vale,et al.  Simulating multivariate nonnormal distributions , 1983 .

[16]  Maurice D. Mulvenna,et al.  Personalization on the Net using Web mining: introduction , 2000, CACM.

[17]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[18]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[19]  Darrell Whitley,et al.  The Travelling Salesman and Sequence Scheduling: Quality Solutions using Genetic Edge Recombination , 1990 .

[20]  Ming S. Hung,et al.  A comparison of nonlinear optimization methods for supervised learning in multilayer feedforward neural networks , 1996 .

[21]  Xin Yao,et al.  An empirical study of genetic operators in genetic algorithms , 1993, Microprocess. Microprogramming.

[22]  S. Ghosh,et al.  An application of a multiple neural network learning system to emulation of mortgage underwriting judgements , 1988, IEEE 1988 International Conference on Neural Networks.

[23]  Chao-Hsien Chu,et al.  Telecommunications Network Design - Comparison of Alternative Approaches , 2000, Decis. Sci..

[24]  D. J. Wu,et al.  Discovering near-optimal pricing strategies for the deregulated electric power marketplace using genetic algorithms , 1999, Decis. Support Syst..

[25]  Lawrence B. Holder,et al.  Exploiting Parallelism in a Structural Scientific Discovery System to Improve Scalability , 1999, J. Am. Soc. Inf. Sci..

[26]  Evangelos Simoudis,et al.  Mining business databases , 1996, CACM.

[27]  Shlomo S. Sawilowsky,et al.  Simulating correlated multivariate nonnormal distributions: Extending the fleishman power method , 1999 .

[28]  Hoda A. ElMaraghy,et al.  Scheduling of manufacturing systems under dual-resource constraints using genetic algorithms , 2000 .

[29]  David E. Goldberg,et al.  Genetic and evolutionary algorithms come of age , 1994, CACM.

[30]  E. Mine Cinar,et al.  Neural Networks: A New Tool for Predicting Thrift Failures , 1992 .

[31]  Bongsik Shin,et al.  Data Mining: New Arsenal for Strategic Decision Making , 1999, J. Database Manag..

[32]  A. Lucas,et al.  Extreme Returns, Downside Risk, and Optimal Asset Allocation , 1998 .