Distributed infrastructures for network data correlation

Distributed data correlation is an extremely important component for many networked applications including searching, monitoring, and anomaly detection, to name a few. In large wireless and wired networks, such as sensor networks and the Internet, it is possible and feasible to build distributed infrastructures that provide users with a distributed data correlation facility. To date, however, there have not been many successful infrastructures designed and especially implemented for distributed data correlation purposes. In this thesis, we address the design and implementation of distributed infrastructures for distributed data correlation. First, we discuss a distributed indexing system, DIM, that supports multi-dimensional range queries for wireless sensor networks. We show that the insertion and query costs in DIM scale as O( N ) under reasonable assumptions of query distributions. We also describe a simple but efficient approach to balance DIM in a realistic sensor network workload. Second, we discuss the design and implementation of an Internet distributed indexing system, MIND, that is specially designed for network monitoring applications. We present the details of MIND, including tuple insertion, query processing, loading balancing, and robustness under failures. We deployed a prototype of MIND on more than 100 nodes on the planet-lab and validated it with traffic traces from two large backbone networks: Abilene and GEANT. Finally, we propose Defeat, a novel network-wide anomaly detection scheme which emphasize on high detection confidence and high robustness against failures or attacks. We show how one can use random aggregations of IP flows to enable more precise identification of the underlying causes of anomalies. We show how to combine traffic sketches with a subspace method to detect anomalies with high accuracy and identify the IP flows that are responsible for the anomaly. The Defeat approach has detection rates comparable to previous methods and detects many more anomalies than prior work, taking us a step closer towards a robust on-line system for anomaly detection and identification.