LOCI: Load Shedding through Class-Preserving Data Acquisition

An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes prediction with partial input. We take advantage of stream data's temporal locality -for example, readings from a temperature sensor usually do not change dramatically over a short period of time -for load shedding. We first show that temporal locality of the original data is preserved by our transform, then we utilize positive and negative knowledge about the data (which is of much smaller size than the data itself) for classification. We employ both analytical and empirical analysis to demonstrate the advantage of our approach.

[1]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[3]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Carlo Zaniolo,et al.  Query Languages and Data Models for Database Sequences and Data Streams , 2004, VLDB.

[5]  Philip S. Yu,et al.  Suppressing model overfitting in mining concept-drifting data streams , 2006, KDD '06.

[6]  Haixun Wang,et al.  Semantic Data Management: Towards Querying Data with their Meaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[8]  Hongjun Lu,et al.  Classifying High-Speed Text Streams , 2003, WAIM.

[9]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Xindong Wu,et al.  Combining proactive and reactive predictions for data streams , 2005, KDD '05.

[11]  Carlo Zaniolo,et al.  Optimizing Timestamp Management in Data Stream Management Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Haixun Wang,et al.  Stay Current and Relevant in Data Mining Research , 2005, DASFAA.

[13]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[14]  Philip S. Yu,et al.  A fast algorithm for subspace clustering by pattern similarity , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[15]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[16]  Joseph L. Hellerstein,et al.  FARM: a framework for exploring mining spaces with multiple attributes , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Carlo Zaniolo,et al.  Logic-Based User-Defined Aggregates for the Next Generation of Database Systems , 1999, The Logic Programming Paradigm.

[18]  Philip S. Yu,et al.  Compact reachability labeling for graph-structured data , 2005, CIKM '05.

[19]  Jian Pei,et al.  Preference-Based Frequent Pattern Mining , 2005, Int. J. Data Warehous. Min..

[20]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[21]  Philip S. Yu,et al.  A Sampling-Based Approach to Information Recovery , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Carlo Zaniolo,et al.  User-Defined Aggregates in Database Languages , 1999, DBPL.

[23]  Carlo Zaniolo,et al.  Incompleteness of Database Languages for Data Streams and Data Mining: the Problem and the Cure , 2003, SEBD.

[24]  Philip S. Yu,et al.  Adaptive Load Diffusion for Multiway Windowed Stream Joins , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Jian Pei,et al.  A Random Method for Quantifying Changing Distributions in Data Streams , 2005, PKDD.

[26]  Naoki Abe,et al.  Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.

[27]  Carlo Zaniolo,et al.  User-Defined Aggregates for Datamining , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[28]  Carlo Zaniolo,et al.  ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams , 2003, VLDB.

[29]  Philip S. Yu,et al.  A fully distributed framework for cost-sensitive data mining , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[30]  Philip S. Yu,et al.  Fast computing reachability labelings for large graphs with high compression rate , 2008, EDBT '08.

[31]  Philip S. Yu,et al.  An Improved Biclustering Method for Analyzing Gene Expression Profiles , 2005, Int. J. Artif. Intell. Tools.

[32]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[33]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[34]  Philip S. Yu,et al.  SSDT: a scalable subspace-splitting classifier for biased data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[35]  Philip S. Yu,et al.  A Framework for Scalable Cost-sensitive Learning Based on Combing Probabilities and Benefits , 2002, SDM.

[36]  Philip S. Yu,et al.  Near-Neighbor Search in Pattern Distance Spaces , 2005, SDM.

[37]  Philip S. Yu,et al.  Stop Chasing Trends: Discovering High Order Models in Evolving Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[38]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[39]  Carlo Zaniolo,et al.  A native extension of SQL for mining data streams , 2005, SIGMOD '05.

[40]  Carlo Zaniolo,et al.  Using SQL to Build New Aggregates and Extenders for Object- Relational Systems , 2000, VLDB.

[41]  Xiaofeng Meng,et al.  Estimating the Selectivity of XML Path Expression with Predicates by Histograms , 2004, WAIM.

[42]  Carlo Zaniolo,et al.  Extending SQL for Decision Support Applications , 2002, DMDW.

[43]  Carlo Zaniolo,et al.  User defined aggregates in object-relational systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[44]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[45]  Carlo Zaniolo,et al.  CMP: a fast decision tree classifier using multivariate predictions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[46]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[47]  Haixun Wang,et al.  The S2-Tree : An Index Structure for Subsequence Matching of Spatial Objects , 2001, PAKDD.

[48]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[49]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[50]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[51]  Wei Peng,et al.  Event summarization for system management , 2007, KDD '07.

[52]  Philip S. Yu,et al.  Loadstar: Load Shedding in Data Stream Mining , 2005, VLDB.

[53]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[54]  Carlo Zaniolo,et al.  Database System Extensions for Decision Support: the AXL Approach , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[55]  Haixun Wang,et al.  Landmarks: a new model for similarity-based pattern querying in time series databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[56]  Haixun Wang,et al.  Empirical comparison of various reinforcement learning strategies for sequential targeted marketing , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[57]  Joseph L. Hellerstein,et al.  User-directed exploration of mining space with multiple attributes , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[58]  Carlo Zaniolo,et al.  Toward Extensible Spatio-Temporal Databases: An Approach Based on User-Defined Aggregates , 2004 .

[59]  Philip S. Yu,et al.  Indexing weighted-sequences in large databases , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[60]  Philip S. Yu,et al.  Mining associations by pattern structure in large relational tables , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[61]  Huan Liu,et al.  Efficiently handling feature redundancy in high-dimensional data , 2003, KDD '03.

[62]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[63]  Jian Pei,et al.  Computing Compressed Multidimensional Skyline Cubes Efficiently , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[64]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[65]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[66]  Carlo Zaniolo,et al.  A Flexible Query Graph Based Model for the Efficient Execution of Continuous Queries , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[67]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[68]  Xiaofeng Meng,et al.  Providing freshness guarantees for outsourced databases , 2008, EDBT '08.

[69]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[70]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[71]  Philip S. Yu,et al.  Loadstar: A Load Shedding Scheme for Classifying Data Streams , 2005, SDM.

[72]  Xiaofeng Meng,et al.  On the sequencing of tree structures for XML indexing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[73]  Carlo Zaniolo,et al.  User Defined Aggregates for Logical Data Languages , 1998, DDLP.

[74]  Philip S. Yu,et al.  Demand-driven frequent itemset mining using pattern structures , 2005, Knowledge and Information Systems.

[75]  Philip S. Yu,et al.  Pattern-based similarity search for microarray data , 2005, KDD '05.

[76]  Haixun Wang,et al.  Location-based Spatial Queries with Data Sharing in Wireless Broadcast Environments , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[77]  Carlo Zaniolo,et al.  ATLaS: A Native Extension of SQL for Data Mining , 2003, SDM.

[78]  Carlo Zaniolo,et al.  Load Shedding in Classifying Multi-Source Streaming Data: A Bayes Risk Approach , 2007, SDM.

[79]  Haixun Wang,et al.  Lock-free consistency control for web 2.0 applications , 2008, WWW.

[80]  Philip S. Yu,et al.  Pruning and dynamic scheduling of cost-sensitive ensembles , 2002, AAAI/IAAI.

[81]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.