A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database

Finding outliers, rare events from a collection of patterns, has become an emerging issue in the area of machine learning concerned with detecting and eventually removing anomalous objects in data. A key challenge with outliers/anomalies detection is because they are not a well-formulated issue. Outliers are defined as the extreme values that deviate from the overall patterns in data; they may indicate experimental errors, variability in measurement, or a novelty. Detecting outliers in large databases can lead to the discovery of hidden knowledge. However, identifying and removing outliers often helps to assure that the observations represent the problem correctly. Though there are several techniques for detecting outliers/anomalies in a given database, thus, no single technique is proven to be the standard universal choice. Depending on the nature of the target application, different implementations require the use of different outlier detection methods. The clustering method is a very powerful method in the field of machine learning and defines outliers in terms of their distance to the cluster centers. In this study, we propose a clustering-based approach to identifying outliers in a retail point-of-sales dataset. To select the best clustering algorithm for the purpose, two algorithms are applied, K-means for hard, crisp clustering, and (FCM) Fuzzy C-means for soft clustering. The experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-means algorithm in terms of outlier detection efficiency, and it is an effective outlier detection solution.

[1]  Jun Huang,et al.  An approach for improving K-means algorithm on market segmentation , 2010, 2010 International Conference on System Science and Engineering.

[2]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[4]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[5]  S. S. Dhande Outlier Detection over Data Set Using Cluster-Based and Distance-Based Approach , 2012 .

[6]  Lajos Rónyai,et al.  Trie: An alternative data structure for data mining algorithms , 2003 .

[7]  Sanjay Kumar,et al.  A Comparative Study of Various Data Transformation Techniques in Data Mining , 2015 .

[8]  Bharati M. Ramageri ROLE OF DATA MINING IN RETAIL SECTOR , 2013 .

[9]  Paul D. Reynolds,et al.  New venture strategies: theory development with an empirical base , 1994 .

[10]  Tom Fawcett,et al.  Data science for business , 2013 .

[11]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[12]  Sabhia Firdaus,et al.  A Survey on Clustering Algorithms and Complexity Analysis , 2015 .

[13]  Kishana R. Kashwan,et al.  Customer Segmentation Using Clustering and Data Mining Techniques , 2013 .

[14]  G. Meera Gandhi,et al.  Cluster Based Outlier Detection Algorithm for Healthcare Data , 2015 .

[15]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[16]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[17]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[18]  Himansu Sekhar Behera,et al.  Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014 , 2015 .

[19]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[20]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[21]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[22]  G. Chakraborty,et al.  Comparison of Probabilistic-D and k-Means Clustering in Segment Profiles for B2B Markets , 2011 .

[23]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, Data Mining and Knowledge Discovery.

[24]  D. Gunaseelan,et al.  An Improved Frequent Pattern Algorithm for Mining Association Rules , 2012 .

[25]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[26]  Jinxin Dong,et al.  K-means Optimization Algorithm for Solving Clustering Problem , 2009, 2009 Second International Workshop on Knowledge Discovery and Data Mining.

[27]  G. Prayag,et al.  Market Segmentation using Bagged Fuzzy C–Means (BFCM): Destination Image of Western Europe among Chinese Travellers , 2013 .

[28]  Markku Heikkila,et al.  Segmenting Retail Customers with an Enhanced RFM and a Hybrid Regression/Clustering Method , 2018, 2018 International Conference on Machine Learning and Data Engineering (iCMLDE).

[29]  M. Tahar Kechadi,et al.  Customer Segmentation Architecture Based on Clustering Techniques , 2010, 2010 Fourth International Conference on Digital Society.

[30]  K. Bataineh,et al.  A Comparison Study between Various Fuzzy Clustering Algorithms , 2011 .

[31]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.