Geometric synopses for multi-dimensional data streams
暂无分享,去创建一个
In recent years, a new class of applications requiring real time processing of large volumes of streaming data has emerged. These data streams are seen in many applications such as network monitoring, mining financial stock market feeds, transactional databases, and sensor networks. As the amount of input data is generally too large to fit into memory, there is a need to design efficient techniques that can build compact summaries which can answer useful queries such as heavy hitters, quantiles, range queries, and clustering.
In this thesis, we present space-efficient schemes to summarize multidimensional data streams. Our first data structure, called Adaptive Spatial Partitioning (ASP), can answer multidimensional versions of various statistical queries such as frequency items (or iceberg), quantiles (or ranks) and range queries. The scheme extends to the sliding window model, a subclass of turnstile model, and weighted streams. ASP can also be constructed in a distributed setting, where the streams arrive at different locations. We then discuss application of our data structure in building specialized hardware for program execution profiling.
We next focus on hierarchical heavy hitters (HHHs), which have been introduced as a natural generalization of heavy hitters for hierarchical data domains. We characterize the hardness of computing HHHs by providing space lower bounds for this problem. Specifically, we show that a single-pass deterministic scheme that computes p-HHHs in a d-dimensional hierarchy with any approximation guarantee must use Ω(1=p d+1) space.
Finally, we present our result on shape based clustering in data streams. We consider the following problem: given a stream of two-dimensional points, how can we summarize its spatial distribution or shape using a small, bounded amount of memory? We propose a novel scheme, called ClusterHull, which represents the shape of the stream as a dynamic collection of convex hulls, with a total of at most m vertices, where m is the size of the memory.