A Survey of Distributed Mining of Data Streams

With advances in data collection and generation technologies, organizations and researchers are faced with the ever growing problem of how to manage and analyze large dynamic datasets. Environments that produce streaming sources of data are becoming common place. Examples include stock market, sensor, web click stream, and network data. In many instances, these environments are also equipped with multiple distributed computing nodes that are often located near the data sources. Analyzing and monitoring data in such environments requires data mining technology that is cognizant of the mining task, the distributed nature of the data, and the data influx rate. In this chapter, we survey the current state of the field and identify potential directions of future research.

[1]  Johannes Gehrke,et al.  DEMON: mining and monitoring evolving data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[2]  Srinivasan Parthasarathy,et al.  Mining frequent itemsets in distributed and dynamic databases , 2003, Third IEEE International Conference on Data Mining.

[3]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[4]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[5]  D. Cheung,et al.  Maintenance of Discovered Association Rules: When to update? , 1997, DMKD.

[6]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[7]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[8]  Srinivasan Parthasarathy,et al.  Facilitating interactive distributed data stream processing and mining , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[9]  Srinivasan Parthasarathy,et al.  A characterization of data mining algorithms on a modern processor , 2005, DaMoN '05.

[10]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[11]  Dimitrios Gunopulos,et al.  Distributed deviation detection in sensor networks , 2003, SGMD.

[12]  Peter G. Neumann,et al.  EMERALD: Event Monitoring Enabling Responses to Anomalous Live Disturbances , 1997, CCS 2002.

[13]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[15]  Salvatore J. Stolfo,et al.  Collaborative Distributed Intrusion Detection , 2004 .

[16]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[17]  S-W Lee,et al.  Biologically Motivated Computer Vision , 2000, Lecture Notes in Computer Science.

[18]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[19]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[20]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[21]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[22]  Beth Plale Leveraging run time knowledge about event rates to improve memory utilization in wide area data stream filtering , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[23]  Salvatore J. Stolfo,et al.  A Data Mining and CIDF Based Approach for Detecting Novel and Distributed Intrusions , 2000, Recent Advances in Intrusion Detection.

[24]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[25]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[26]  Liang Chen,et al.  GATES: a grid-based middleware for processing distributed data streams , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[27]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[28]  Srinivasan Parthasarathy,et al.  Towards NIC-based intrusion detection , 2003, KDD '03.

[29]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[30]  Wenke Lee,et al.  A cooperative intrusion detection system for ad hoc networks , 2003, SASN '03.

[31]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[32]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[33]  Lei Liu,et al.  MobiMine: monitoring the stock market from a PDA , 2002, SKDD.

[34]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[35]  Dong Xuan,et al.  Middleware-based approach for preventing distributed deny of service attacks , 2002, MILCOM 2002. Proceedings.

[36]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[37]  Hillol Kargupta,et al.  A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments , 2004, IEEE Transactions on Knowledge and Data Engineering.

[38]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[39]  Angelos D. Keromytis,et al.  Implementing a distributed firewall , 2000, CCS.

[40]  David Wai-Lok Cheung,et al.  A General Incremental Technique for Maintaining Discovered Association Rules , 1997, DASFAA.

[41]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[42]  Sanjay Ranka,et al.  An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases , 1997, KDD.

[43]  Philip S. Yu,et al.  Loadstar: A Load Shedding Scheme for Classifying Data Streams , 2005, SDM.

[44]  Srinivasan Parthasarathy,et al.  Mining Frequent Itemsets in Evolving Databases , 2002, SDM.

[45]  Wenke Lee,et al.  Intrusion Detection Techniques for Mobile Wireless Networks , 2003, Wirel. Networks.