An empirical study of the effect of outliers on the misclassification error rate

An outlier is an observation that deviates so much from other observations that it seems to have been generated by a different mechanism. Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that exhibit a behavior that is very different from most of the individuals of the data set. Frequently, outliers are removed to improve accuracy of estimators, but sometimes, the presence of an outlier has a certain meaning, which explanation can be lost if the outlier is deleted. In this paper we study the effect of the presence of outliers on the performance of three well-known classifiers based on the results observed on four real world datasets. We use detection of outliers based on robust statistical estimators of the center and the covariance matrix for the Mahalanobis distance, detection of outliers based on clustering using the partitioning around medoids (PAM) algorithm, and two data mining techniques to detect outliers: Bay’s algorithm for distance-based outliers, and the LOF, a density-based local outlier algorithm.

[1]  Raymond T. Ng,et al.  A unified approach for mining outliers , 1997, CASCON.

[2]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[3]  A. Madansky Identification of Outliers , 1988 .

[4]  Kay I Penny,et al.  A comparison of multivariate outlier detection methods for clinical laboratory safety data , 2001 .

[5]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[6]  Tena I. Katsaounis,et al.  Exploring Multivariate Data With the Forward Search , 2006 .

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[9]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[10]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[11]  A. Atkinson Fast Very Robust Methods for the Detection of Multiple Outliers , 1994 .

[12]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[13]  David L. Woodruff,et al.  Computational Connections between Robust Multivariate Analysis and Clustering , 2002, COMPSTAT.

[14]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[15]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[16]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[17]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[20]  A. Hadi Identifying Multiple Outliers in Multivariate Data , 1992 .

[21]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[22]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[23]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[24]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[25]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[26]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[27]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.