Data outlier detection using the Chebyshev theorem

During data collection and analysis, it is often necessary to identify and possibly remove outliers that exist. An objective method for identifying outliers to be removed is critical. Many automated outlier detection methods are available. However, many are limited by assumptions of a distribution or require upper and lower predefined boundaries in which the data should exist. If there is a known distribution for the data, then using that distribution can aid in finding outliers. Often, a distribution is not known, or the experimenter does not want to make an assumption about a certain distribution. Also, enough information may not exist about a set of data to be able to determine reliable upper and lower boundaries. For these cases, an outlier detection method, using the empirical data and based upon Chebyshev's inequality, was formed. This method allows for detection of multiple outliers, not just one at a time. This method also assumes that the data are independent measurements and that a relatively small percentage of outliers are contained in the data. Chebyshev's inequality gives a bound of what percentage of the data falls outside of k standard deviations from the mean. This calculation holds no assumptions about the distribution of the data. If the data are known to be unimodal without a known distribution, then the method can be improved by using the unimodal Chebyshev inequality. The Chebyshev outlier detection method uses the Chebyshev inequality to calculate upper and lower outlier detection limits. Data values that are not within the range of the upper and lower limits would be considered data outliers. Outliers could be due to erroneous data or could indicate that the data are correct but highly unusual. This algorithm does not ascertain the reason for the outlier; it identifies potential outlier data, allowing for domain experts to investigate the cause

[1]  J. Wishart,et al.  Statistics in Research. , 1956 .