Statistical Outlier Detection in Large Multivariate Datasets

This method focuses on detecting outliers within large and very large datasets using a computationally efficient procedure. Tukey’s biweight function is applied on the dataset for obtaining robust location and scale estimates of the data by filtering out the effects of extreme values. Robust Mahalanobis distances for all data points are calculated using these location and scale estimates. Next density estimation by Parzen window is utilized for computing the probability density curve of the robust Mahalanobis distances. Outliers are identified to be those points whose robust Mahalanobis distances have very low probability density.