A Novel Approach to Outlier Detection using Modified Grey Wolf Optimization and k-Nearest Neighbors Algorithm

Objectives: Detecting dataset anomalies has been an interesting yet challenging area in this front. This work proposes a hybrid model using meta-heuristics to detect dataset anomalies efficiently. Methods/Statistical Analysis: A distance based modified grey wolf optimization algorithm is designed which uses the k- Nearest Neighbor algorithm for better results. The proposed approach works well with supervised datasets and gives anomalies with respect to each attribute of the dataset based on a threshold using a confidence interval. Findings: The proposed approach works well with regression as well as classification datasets in the supervised scenario. The results in terms of number of anomalies and the accuracy using confusion matrix are depicted and have been evaluated for a classification dataset considering a minority class to be anomalous in comparison to the majority class. The results have been evaluated based on varying the threshold and ‘k’ values and depending on the data set domain and data distribution the optimal parameters can be identified and used. Application/Improvements: The proposed approach can be used for anomaly detection of datasets of different domains of supervised scenario. It can also be extended to unsupervised scenario by integrating it with K-means clustering.

[1]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[2]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[3]  A. Madansky Identification of Outliers , 1988 .

[4]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[5]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[7]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Reda Alhajj,et al.  A comprehensive survey of numeric and symbolic outlier mining techniques , 2006, Intell. Data Anal..

[9]  P. Danielsson Euclidean distance mapping , 1980 .

[10]  Jarek Gryz,et al.  Fundamentals of Order Dependencies , 2012, Proc. VLDB Endow..

[11]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[12]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[13]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[14]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[15]  Jim Freeman,et al.  Outliers in Statistical Data (3rd edition) , 1995 .

[16]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[18]  R. Nagaraj,et al.  Anomaly Detection via Online Oversampling Principal Component Analysis , 2014 .

[19]  Divesh Srivastava,et al.  Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  M.M. Deris,et al.  A Comparative Study for Outlier Detection Techniques in Data Mining , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[21]  Sanjay Ranka,et al.  Conditional Anomaly Detection , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  ScienceDirect,et al.  Advances in engineering software , 2008, Adv. Eng. Softw..

[23]  Junliang Chen,et al.  ODDC: Outlier Detection Using Distance Distribution Clustering , 2007, PAKDD Workshops.

[24]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[25]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .