论文信息 - Outliers and the Simpson's Paradox

Outliers and the Simpson's Paradox

There is no standard definition of outliers, but most authors agree that outliers are points far from other data points. Several outlier detection techniques have been developed mainly with two different purposes. On one hand, outliers are the interesting observations, like in fraud detection, on the other side, outliers are considered measurement observations that should be removed from the analysis, e.g. robust statistics. In this work, we start from the observation that outliers are effected by the so called Simpson paradox: a trend that appears in different groups of data but disappears or reverses when these groups are combined. Given a dataset, we learn a regression tree. The tree grows by partitioning the data into groups more and more homogeneous of the target variable. At each partition defined by the tree, we apply a box plot on the target variable to detect outliers. We would expected that deeper nodes of the tree contain less and less outliers. We observe that some points previously signaled as outliers are no more signaled as such, but new outliers appear. The identification of outliers depends on the context considered. Based on this observation, we propose a new method to quantify the level of outlierness of data points.

Rita P. Ribeiro | João Gama | Eduarda Portela | João Gama | Eduarda Portela

[1] Shuchita Upadhyaya,et al. Outlier Detection: Applications And Techniques , 2012 .

[2] Douglas M. Hawkins. Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[3] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[4] A. R. Crathorne,et al. Economic Control of Quality of Manufactured Product. , 1933 .

[5] B. Ripley,et al. Recursive Partitioning and Regression Trees , 2015 .

[6] Rita P. Ribeiro,et al. Detection of Fraud Symptoms in the Retail Industry , 2016, IBERAMIA.

[7] C. Blyth. On Simpson's Paradox and the Sure-Thing Principle , 1972 .

[8] Athanasios Tsanas,et al. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[9] VARUN CHANDOLA,et al. Anomaly detection: A survey , 2009, CSUR.

[10] Victoria J. Hodge,et al. A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.