Distributed Outlier Detection using Compressive Sensing

Computing outliers and related statistical aggregation functions from large-scale big data sources is a critical operation in many cloud computing scenarios, e.g. service quality assurance, fraud detection, or novelty discovery. Such problems commonly have to be solved in a distributed environment where each node only has a local slice of the entirety of the data. To process a query on the global data, each node must transmit its local slice of data or an aggregated subset thereof to a global aggregator node, which can then compute the desired statistical aggregation function. In this context, reducing the total communication cost is often critical to the overall efficiency. In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure. Specifically, we devise a new aggregation paradigm for outlier detection and related queries. The paradigm leverages compressive sensing for data sketching in combination with outlier detection techniques. We further propose an algorithm that works even for non-sparse data that concentrates around an unknown value. In both cases, we show that the communication cost is reduced to the logarithm of the global data size. We incorporate our approach into Hadoop and evaluate it on real web-scale production data (distributed click-data logs). Our approach reduces data shuffling IO by up to 99%, and end-to-end job duration by up to 40% on many actual production queries.

[1]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[2]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[3]  Stefan Schmid,et al.  Distributed computation of the mode , 2008, PODC '08.

[4]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[5]  Ting Sun,et al.  Single-pixel imaging via compressive sampling , 2008, IEEE Signal Process. Mag..

[6]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[7]  Roger Wattenhofer,et al.  Tight bounds for distributed selection , 2007, SPAA '07.

[8]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[9]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[10]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[11]  Jiaxing Zhang,et al.  Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[12]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[13]  T. Blumensath,et al.  On the Difference Between Orthogonal Matching Pursuit and Orthogonal Least Squares , 2007 .

[14]  Zheng Zhang,et al.  Error-bounded Sampling for Analytics on Big Sparse Data , 2014, Proc. VLDB Endow..

[15]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[16]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[17]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[18]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[19]  Jiaxing Zhang,et al.  Impression Store: Compressive Sensing-based Storage for Big Data Analytics , 2014, HotCloud.

[20]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[21]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Wei Lin,et al.  Microsoft Bing Peking University , 2022 .

[23]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[24]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[25]  Walter Willinger,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009, SIGCOMM '09.

[26]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[27]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[28]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[29]  Mircea Andrecut,et al.  Fast GPU Implementation of Sparse Signal Recovery from Random Projections , 2008, Eng. Lett..

[30]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[31]  Liang Chen,et al.  GPU Implementation of Orthogonal Matching Pursuit for Compressive Sensing , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[32]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[33]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[34]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[35]  Andrea Montanari,et al.  The Noise-Sensitivity Phase Transition in Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[36]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[37]  Jin Cao,et al.  A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[38]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[39]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[40]  Boaz Patt-Shamir A note on efficient aggregate queries in sensor networks , 2007, Theor. Comput. Sci..

[41]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[42]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[43]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[44]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.