Efficient estimation of dynamic density functions with an application to outlier detection

In this paper, we propose a new method to estimate the dynamic density over data streams, named KDE-Track as it is based on a conventional and widely used Kernel Density Estimation (KDE) method. KDE-Track can efficiently estimate the density with linear complexity by using interpolation on a kernel model, which is incrementally updated upon the arrival of streaming data. Both theoretical analysis and experimental validation show that KDE-Track outperforms traditional KDE and a baseline method Cluster-Kernels on estimation accuracy of the complex density structures in data streams, computing time and memory usage. KDE-Track is also demonstrated on timely catching the dynamic density of synthetic and real-world data. In addition, KDE-Track is used to accurately detect outliers in sensor data and compared with two existing methods developed for detecting outliers and cleaning sensor data.

[1]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[2]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[3]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[4]  Martin Meckesheimer,et al.  Automatic outlier detection for time series: an application to sensor data , 2007, Knowledge and Information Systems.

[5]  Daniel Curiac,et al.  Malicious Node Detection in Wireless Sensor Networks Using an Autoregression Technique , 2007, International Conference on Networking and Services (ICNS '07).

[6]  Bernhard Seeger,et al.  Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ramesh Govindan,et al.  On the Prevalence of Sensor Faults in Real-World Deployments , 2007, 2007 4th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks.

[8]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[9]  M. C. Jones Discretized and Interpolated Kernel Density Estimates , 1989 .

[10]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[11]  Gene H. Golub,et al.  Algorithms for Computing the Sample Variance: Analysis and Recommendations , 1983 .

[12]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[13]  Le Gruenwald,et al.  DBOD-DS: Distance Based Outlier Detection for Data Streams , 2010, DEXA.