A flexible data mining architecture for monitoring data streams

Data streams are ubiquitous: performance measurements in business process management, faults and alarms in network traffic management, transactions in retail chains, ATM operations in banks, log records generated by web servers, and sensor network data are some specific examples. In almost all of these applications, the data volume is massive, up to several terabytes. Data volume increases even further with the rapid arrival of new tuples. Traditional DBMS's are ill-equipped for processing of data streams in real time, and do not provide adequate support for handling continuous queries posed over these streams. This dissertation outlines models and issues towards designing an efficient Data Stream Management System (DSMS) called Stardust. The system can handle a diverse set of continuous queries that fit naturally into the mold of data stream applications. We developed wavelet-based approximation schemes that maintain multiple levels of information over streams of data in order to answer queries efficiently. In centralized DSMS models, a stream is summarized at a central site, and all user queries are processed at this site. In data and query intensive environments, the central site can become a bottleneck. As a remedy to this problem, we developed adaptive replication algorithms for dissemination of stream summaries computed at a central site to interested clients. We tested the distributed version of the system on a number of testbeds. In the first scenario, Stardust exploits the scalability and load balancing of communication provided by content-based routing schemes for efficient distributed stream processing. In the second scenario, we integrated Stardust into a real-time decision support system for nondestructive health monitoring using a wireless network of sensors. The system trades off accuracy for efficient processing of sensor data in order to save the communication overhead and power-consumption. Finally, we built an event detection framework for monitoring a set of distributed network elements. The goal is to detect potentially interesting incidents specified by users in terms of a multitude of race conditions across a set of routers while maintaining a low monitoring overhead.

[1]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[2]  Deborah Estrin,et al.  GHT: a geographic hash table for data-centric storage , 2002, WSNA '02.

[3]  Deborah Estrin,et al.  DIFS: a distributed index for features in sensor networks , 2003, Ad Hoc Networks.

[4]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[5]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[6]  Sally Floyd,et al.  Wide area traffic: the failure of Poisson modeling , 1995, TNET.

[7]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[8]  Deborah Estrin,et al.  Habitat monitoring: application driver for wireless communications technology , 2001, CCRV.

[9]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[10]  Yang-Sae Moon,et al.  General match: a subsequence matching method in time-series databases based on generalized windows , 2002, SIGMOD '02.

[11]  Mong-Li Lee,et al.  Supporting Frequent Updates in R-Trees: A Bottom-Up Approach , 2003, VLDB.

[12]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[13]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[14]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[15]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[16]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Paul G. Spirakis,et al.  NanoPeer networks and P2P worlds , 2003, Proceedings Third International Conference on Peer-to-Peer Computing (P2P2003).

[18]  Carlo Zaniolo,et al.  Query Languages and Data Models for Database Sequences and Data Streams , 2004, VLDB.

[19]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[20]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[21]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[22]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[23]  Wei Hong,et al.  Beyond Average: Toward Sophisticated Sensing with Queries , 2003, IPSN.

[24]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[25]  Jennifer Widom,et al.  Adaptive precision setting for cached approximate values , 2001, SIGMOD '01.

[26]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[28]  Srinivasan Seshan,et al.  Cache-and-query for wide area sensor databases , 2003, SIGMOD '03.

[29]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[30]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[31]  Peter C. Young,et al.  Recursive Estimation and Time-Series Analysis: An Introduction , 1984 .

[32]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[33]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[34]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[35]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[36]  Scott Shenker,et al.  Internet indirection infrastructure , 2002, SIGCOMM 2002.

[37]  Shanzhong Zhu,et al.  Stochastic Consistency, and Scalable Pull-Based Caching for Erratic Data Sources , 2004, VLDB.

[38]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[39]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[40]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[41]  Dimitrios Gunopulos,et al.  Online amnesic approximation of streaming time series , 2004, Proceedings. 20th International Conference on Data Engineering.

[42]  Walid G. Aref,et al.  Nile: a query processing engine for data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[43]  Samuel Madden,et al.  Distributed regression: an efficient framework for modeling sensor network data , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[44]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[45]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[46]  Glenn Washer,et al.  Reliability of visual inspection for highway bridges , 2001 .

[47]  Wei Hong,et al.  Exploiting correlated attributes in acquisitional query processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[48]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[49]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[50]  Prashant J. Shenoy,et al.  Adaptive push-pull: disseminating dynamic web data , 2001, WWW '01.

[51]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[52]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[53]  Yixin Chen,et al.  Online Analytical Processing Stream Data: Is It Feasible? , 2002, DMKD.

[54]  Krithi Ramamritham,et al.  An Efficient and Resilient Approach to Filtering and Disseminating Streaming Data , 2003, VLDB.

[55]  Ambuj K. Singh,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[56]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[57]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[58]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[59]  Bobby Bhattacharjee,et al.  Scalable application layer multicast , 2002, SIGCOMM '02.

[60]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[61]  Christos Faloutsos,et al.  AWSOM: Adaptive, Hands-Off Stream Mining , 2003 .

[62]  Indranil Gupta,et al.  Scalable fault-tolerant aggregation in large process groups , 2001, 2001 International Conference on Dependable Systems and Networks.

[63]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[64]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[65]  Solomon Kullback,et al.  Approximating discrete probability distributions , 1969, IEEE Trans. Inf. Theory.

[66]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[67]  Alberto O. Mendelzon,et al.  Efficient Retrieval of Similar Time Sequences Using DFT , 1998, FODO.

[68]  Donghui Zhang,et al.  Online event-driven subsequence matching over financial data streams , 2004, SIGMOD '04.

[69]  Christos Faloutsos,et al.  Data mining meets performance evaluation: fast algorithms for modeling bursty traffic , 2002, Proceedings 18th International Conference on Data Engineering.

[70]  Ouri Wolfson,et al.  Divergence caching in client-server architectures , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[71]  Danny Raz,et al.  Efficient reactive monitoring , 2002, IEEE J. Sel. Areas Commun..

[72]  Deborah Estrin,et al.  An evaluation of multi-resolution storage for sensor networks , 2003, SenSys '03.

[73]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[74]  S. Mallat A wavelet tour of signal processing , 1998 .

[75]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[76]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[77]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[78]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[79]  Ambuj K. Singh,et al.  Variable length queries for time series data , 2001, Proceedings 17th International Conference on Data Engineering.

[80]  Danny Raz,et al.  Toward efficient monitoring , 2000, IEEE Journal on Selected Areas in Communications.

[81]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[82]  E. Rundensteiner,et al.  BFRJ: Global Optimization of Spatial Joins Using R-trees , 1997 .

[83]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.