Outliers Identification Model in Point-of-Sales Data Using Enhanced Normal Distribution Method

Data Mining extrapolates patterns drawing conclusions from data. Outliers detection identifies those objects that fall some standard deviations away from the mean and is an important tool of commercial data mining. Characterizing the manner of outliers can lead to new knowledge, such as the manner of fraudulent transactions. However, outliers may represent meaningless aberrations and hence there is no rigid mathematical or statistical definition of what constitutes an outlier, and, in many scenarios, determination of the outlier is ultimately a subjective exercise. Standard deviation is a central actor in outlier detection and yet exhibits sensitivity to values and can be distorted, inflated, by a single or even a few observations of borderline and extreme values. It can mask the situation where less extreme outliers or anomalies go undetected because of the existence of the most extreme outliers. This study proposes a novel outlier identification model using an enhanced normal distribution method. The model can explore different types of outliers giving an end-user the ability to fully or partially eliminate outliers found in a retail point of sale (POS) dataset. Experiments revealed that the enhanced normal distribution method appeared more accurate than the standard normal distribution method, and results were also evaluated subjectively by the client, who found most of the outliers to be truly outliers and some representing potentially fraudulent transactions.

[1]  S. Seo A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets , 2006 .

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Pasi Fränti,et al.  Outlier Detection Using k-Nearest Neighbour Graph , 2004, ICPR.

[4]  Denis Cousineau,et al.  Outliers detection and treatment: a review , 2010 .

[5]  G. Athithan,et al.  Data Mining Techniques for Outlier Detection , 2011 .

[6]  G. Moddeman,et al.  Unraveling the Mystery of Health , 1995 .

[7]  Daniel Peña,et al.  Gibbs Sampling Will Fail in Outlier Problems with Strong Masking , 1996 .

[8]  W. Hays Using Multivariate Statistics , 1983 .

[9]  S. Greven,et al.  Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains , 2015, 1509.02029.

[10]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[11]  Olivier Klein,et al.  Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance , 2018 .

[12]  Haibo He,et al.  A local density-based approach for outlier detection , 2017, Neurocomputing.

[13]  Philip Calvert,et al.  Encyclopedia of Data Warehousing and Mining , 2006 .

[14]  L. Festinger,et al.  A Theory of Cognitive Dissonance , 2017 .

[15]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[16]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[17]  Jim Freeman,et al.  Outliers in Statistical Data (3rd edition) , 1995 .

[18]  ShimKyuseok,et al.  Efficient algorithms for mining outliers from large data sets , 2000 .

[19]  E. Acuña,et al.  A Meta analysis study of outlier detection methods in classification , 2004 .

[20]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[21]  B. Tabachnick,et al.  Using multivariate statistics, 5th ed. , 2007 .

[22]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[23]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[24]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[25]  Ke-Hai Yuan,et al.  Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation , 2016, Behavior Research Methods.

[26]  J. P. Park The Identification Of Multiple Outliers , 2000 .

[27]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.