Fast online computation of the Qn estimator with applications to the detection of outliers in data streams

Abstract We present fqn (Fast Q n ), a novel algorithm for online computation of the Q n scale estimator. The algorithm works in the sliding window model, cleverly computing the Q n scale estimator in the current window. We thoroughly compare our algorithm for online Q n with the state of the art competing algorithm by Nunkesser et al., and show that fqn (i) is faster, requiring only O ( s ) time in the worst case where s is the length of the window (ii) its computational complexity does not depend on the input distribution and (iii) it requires less space. To the best of our knowledge, our algorithm is the first that allows online computation of the Q n scale estimator in worst case time linear in the size of the window. As an example of a possible application, besides its use as a robust measure of statistical dispersion, we show how to use the Q n estimator for fast detection of outliers in data streams. Extensive experimental results on both synthetic and real datasets confirm the validity of our approach.

[1]  Donald B. Johnson,et al.  Selecting the Kth element in X + Y and X_1 + X_2 + ... + X_m , 1978, SIAM J. Comput..

[2]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[3]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[4]  Ronald L. Rivest,et al.  Expected time bounds for selection , 1975, Commun. ACM.

[5]  J. Pickands Statistical Inference Using Extreme Order Statistics , 1975 .

[6]  Andranik Mirzaian,et al.  Selection in X+Y and Matrices With Sorted Rows and Columns , 1985, Inf. Process. Lett..

[7]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[8]  Peter J. Rousseeuw,et al.  Time-Efficient Algorithms for Two Highly Robust Estimators of Scale , 1992 .

[9]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[10]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[11]  P. Rousseeuw,et al.  Explicit scale estimators with high breakdown point , 1992 .

[12]  C. A. R. Hoare,et al.  Algorithm 65: find , 1961, Commun. ACM.

[13]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[14]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[15]  P. Rousseeuw,et al.  Alternatives to the Median Absolute Deviation , 1993 .

[16]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[17]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .