Data Mining with Skewed Data

In this chapter, we explore difficulties one often encounters when applying machine learning techniques to real-world data, which frequently show skewness properties. A typical example from industry where skewed data is an intrinsic problem is fraud detection in finance data. In the following we provide examples, where appropriate, to facilitate the understanding of data mining of skewed data. The topics explored include but are not limited to: data preparation, data cleansing, missing values, characteristics construction, variable selection, data skewness, objective functions, bottom line expected prediction, limited resource situation, parametric optimisation, model robustness and model stability.

[1]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[2]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[3]  A. Cohen,et al.  Truncated and Censored Samples from Normal Populations , 1987 .

[4]  D. Brillinger,et al.  Handbook of methods of applied statistics , 1967 .

[5]  Roger M. Stein,et al.  Validation methodologies for default risk models , 2022 .

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  H. Trautmann,et al.  Preference-based Pareto optimization in certain and noisy environments , 2009 .

[8]  Susan J. Slaughter,et al.  The Little SAS Book: A Primer , 1995 .

[9]  Eugene Charniak,et al.  Bayesian Networks without Tears , 1991, AI Mag..

[10]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[11]  Alair Pereira do Lago,et al.  Credit Card Fraud Detection with Artificial Immune System , 2008, ICARIS.

[12]  Q. Cheng,et al.  Weights of evidence modeling and weighted logistic regression for mineral potential mapping , 1993 .

[13]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[14]  Roger M. Stein,et al.  Benchmarking Quantitative Default Risk Models: A Validation Methodology , 2000 .

[15]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[16]  Bernard P. Zeigler,et al.  A Framework for Multiresolution Optimization in a Parallel/Distributed Environment: Simulation of Hierarchical GAs , 1996, J. Parallel Distributed Comput..

[17]  Alair Pereira do Lago,et al.  Comparison with Parametric Optimization in Credit Card Fraud Detection , 2008, 2008 Seventh International Conference on Machine Learning and Applications.