A trimmed mean approach to finding spatial outliers

Outlier detection concerns discovering some unusual data whose behavior is exceptional compared to other data. In contrast to non-spatial outliers which only consider non-spatial attributes, spatial outliers are defined to be those sites which are very different from its neighbors defined in terms of spatial attributes, i.e., locations. In this paper, we propose a local trimmed mean approach to evaluating the spatial outlier factor which is the degree that a site is outlying compared to its neighbors. The structure of our approach strictly follows the general spatial data model, which states spatial data consist of trend, dependence and error. We empirically demonstrate trimmed mean is more outlier-resistant than median in estimating sample location and it is employed to estimate spatial trend in our approach. In addition to using the 1st order neighbors in computing error, we also use higher order neighbors to estimate spatial trend. With true outlier factor supposed to be given by the spatial error model, we compare our approach with spatial statistic and scatter plot. Experimental results on two real datasets show our approach is significantly better than scatter plot, and slightly better than spatial statistic.

[1]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  L. Anselin Spatial Econometrics: Methods and Models , 1988 .

[5]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Ronald P. Barry,et al.  Monte Carlo estimates of the log determinant of large sparse matrices , 1999 .

[7]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[8]  James P. LeSage Arc Mat , a Matlab toolbox for using ArcView Shape files for spatial econometrics and statistics , 2004 .

[9]  R. Shiffler Maximum Z Scores and Outliers , 1988 .

[10]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[13]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[14]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers: algorithms and applications (a summary of results) , 2001, KDD '01.

[15]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[16]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[17]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[18]  Mike Rees,et al.  5. Statistics for Spatial Data , 1993 .

[19]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[20]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[21]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.