2015 Ieee International Conference on Big Data (big Data) Online Anomaly Detection over Big Data Streams

Data quality is a challenging problem in many real world application domains. While a lot of attention has been given to detect anomalies for data at rest, detecting anomalies for streaming applications still largely remains an open problem. For applications involving several data streams, the challenge of detecting anomalies has become harder over time, as data can dynamically evolve in subtle ways following changes in the underlying infrastructure. In this paper, we describe and empirically evaluate an online anomaly detection pipeline that satisfies two key conditions: generality and scalability. Our technique works on numerical data as well as on categorical data and makes no assumption on the underlying data distributions. We implement two metrics, relative entropy and Pearson correlation, to dynamically detect anomalies. The two metrics we use provide an efficient and effective detection of anomalies over high velocity streams of events. In the following, we describe the design and implementation of our approach in a Big Data scenario using state-of-the-art streaming components. Specifically, we build on Kafka queues and Spark Streaming for realizing our approach while satisfying the generality and scalability requirements given above. We show how a combination of the two metrics we put forward can be applied to detect several types of anomalies - like infrastructure failures, hardware misconfiguration or user-driven anomalies - in large-scale telecommunication networks. We also discuss the merits and limitations of the resulting architecture and empirically evaluate its scalability on a real deployment over live streams capturing events from millions of mobile devices.

[1]  Dong Xiang,et al.  Information-theoretic measures for anomaly detection , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[2]  Qiang Ma,et al.  Frugal Streaming for Estimating Quantiles , 2013, Space-Efficient Data Structures, Streams, and Algorithms.

[3]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[5]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[6]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[7]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[8]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[11]  Jiawei Han,et al.  Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data , 2007, VLDB.

[12]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Odysseas Papapetrou,et al.  Sketch-based Querying of Distributed Sliding-Window Data Streams , 2012, Proc. VLDB Endow..

[15]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[16]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[17]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Nishant Garg Apache Kafka , 2013 .

[20]  Tok Wang Ling,et al.  HOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional Data , 2004, VLDB.

[21]  Qingtao Wu,et al.  Network Anomaly Detection Using Time Series Analysis , 2005, Joint International Conference on Autonomic and Autonomous Systems and International Conference on Networking and Services - (icas-isns'05).

[22]  William Chad Young,et al.  Detecting and classifying anomalous behavior in spatiotemporal network data ∗ , 2014 .