Processing continuous queries over streaming data with limited system resources

In a growing number of information processing applications, data takes the form of continuous data streams rather than traditional stored databases (e.g. network monitoring and intrusion detection, financial time series analysis, clickstream analysis, sensor processing). In many data stream applications, data arrival is fast but bursty, and data rates fluctuate over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. This thesis explores several data processing techniques, suitable for use with high-volume data streams, that take into account the limited computing resources available to the data processing system. One approach to the analysis of data streams involves using continuous monitoring queries to track the status of the data stream over time. In contrast to traditional database queries, which give a snapshot of a system at a particular point in time, continuous queries are long-running queries with answers that are continually updated as new data arrives. The first part of this thesis describes two techniques for handling spikes in data arrival rates when processing continuous queries over data streams: (1) an operator scheduling algorithm that makes efficient use of available system memory, and (2) a load-shedding strategy that determines which unprocessed tuples should be dropped to reduce system load while minimizing degradation in the accuracy of query responses. An alternative data analysis approach involves issuing ad hoc exploratory queries about the recent history of the data stream. This style of query can yield insights that are complementary to those obtained via continuous monitoring queries. While providing exact answers to ad hoc historical queries can be impractical due to resource limitations and the high data volumes of many data streams, accurate approximate answers will suffice in many cases. The second part of this thesis studies sampling-based approaches for approximately answering ad hoc queries over data streams, including algorithms for maintaining random samples over sliding windows, as well as non-uniform sampling techniques that give accurate approximate answers to group-by aggregation queries.