Knowledge discovery from data streams

Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

[1]  Ian Davidson,et al.  Clustering with Constraints , 2009, Encyclopedia of Database Systems.

[2]  Martin Schoeberl,et al.  JOP: A Java Optimized Processor for Embedded Real-Time Systems , 2008 .

[3]  George Karypis,et al.  Discovering frequent geometric subgraphs , 2007, Inf. Syst..

[4]  P. J. Santos,et al.  Designing the input vector to ANN-based models for short-term load forecast in electricity distribution systems , 2007 .

[5]  Markus Dahm Byte Code Engineering with the BCEL API , 2007 .

[6]  Bernhard Schölkopf,et al.  A Direct Method for Building Sparse Kernel Learning Algorithms , 2006, J. Mach. Learn. Res..

[7]  José del Campo-Ávila,et al.  Incremental Algorithm Driven by Error Margins , 2006, Discovery Science.

[8]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[9]  Rasmus Ulslev Pedersen,et al.  An Embedded Support Vector Machine , 2006, 2006 International Workshop on Intelligent Solutions in Embedded Systems.

[10]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[11]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[12]  João Gama,et al.  ODAC: Hierarchical Clustering of Time Series Data Streams , 2006, SDM.

[13]  Haixun Wang,et al.  On reducing classifier granularity in mining concept-drifting data streams , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Dimitrios Gunopulos,et al.  A framework for semi-supervised learning based on subjective and objective clustering criteria , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[15]  Ruoming Jin,et al.  An algorithm for in-core frequent itemset mining on streaming data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[17]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[18]  Lawrence B. Holder,et al.  Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[19]  Walid G. Aref,et al.  Periodicity detection in time series databases , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[21]  Jesús S. Aguilar-Ruiz,et al.  Incremental rule learning based on example nearness from numerical data streams , 2005, SAC '05.

[22]  Walter Willinger,et al.  Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications , 2005, Internet Math..

[23]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[24]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[25]  Xindong Wu,et al.  Proactive-Reactive Prediction for Data Streams , 2005 .

[26]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[27]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[28]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[29]  Carlo Zaniolo,et al.  Mining Noisy Data Streams via a Discriminative Model , 2004, Discovery Science.

[30]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[31]  Christos Faloutsos,et al.  Adaptive, unsupervised stream mining , 2004, The VLDB Journal.

[32]  Wei Fan StreamMiner: A Classifier Ensemble-based Engine to Mine Concept-drifting Data Streams , 2004, VLDB.

[33]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[34]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[35]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[36]  Yongdai Kim,et al.  Gradient LASSO for feature selection , 2004, ICML.

[37]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[38]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[39]  Tom Fawcett "In vivo" spam filtering: A challenge problem for data mining , 2004, ArXiv.

[40]  Funda Ergün,et al.  Sublinear Methods for Detecting Periodic Trends in Data Streams , 2004, LATIN.

[41]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[42]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[43]  Ryszard S. Michalski,et al.  Selecting Examples for Partial Memory Learning , 2000, Machine Learning.

[44]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[45]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[46]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[47]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[48]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[49]  Fabrice Labeau,et al.  Discrete Time Signal Processing , 2004 .

[50]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[51]  P. Rodrigues,et al.  Forecast Adaptation to Charge Transfers , 2004 .

[52]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[53]  Kwong-Sak Leung,et al.  Scalable model-based clustering by working on data summaries , 2003, Third IEEE International Conference on Data Mining.

[54]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[55]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[56]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[57]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[58]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[59]  Karin Ackermann,et al.  Categories and Concepts , 2003, Job 28. Cognition in Context.

[60]  Edward L. Robertson,et al.  Mining Frequent Itemsets Over Arbitrary Time Intervals in Data Streams , 2003 .

[61]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[62]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[63]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[64]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[65]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[66]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[67]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[68]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[69]  Sudipto Guha,et al.  Near-optimal sparse fourier representations via sampling , 2002, STOC '02.

[70]  Nikolaos M. Avouris,et al.  The Role of Domain Knowledge in a Large Scale Data Mining Project , 2002, SETN.

[71]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[72]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[73]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[74]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[75]  Vasant Honavar,et al.  Learn++: an incremental learning algorithm for supervised neural networks , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[76]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[77]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[78]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[79]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[80]  Carlos E. Pedreira,et al.  Neural networks for short-term load forecasting: a review and evaluation , 2001 .

[81]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[82]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[83]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[84]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[85]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[86]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[87]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[88]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[89]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[90]  Thorsten Joachims,et al.  Estimating the Generalization Performance of an SVM Efficiently , 2000, ICML.

[91]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[92]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[93]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[94]  Christian Igel,et al.  Improving the Rprop Learning Algorithm , 2000 .

[95]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[96]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[97]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[98]  Laks V. S. Lakshmanan,et al.  Constraint-Based Multidimensional Data Mining , 1999, Computer.

[99]  D. Kacsó,et al.  Approximation by means of piecewise linear functions , 1999 .

[100]  A. Dawid,et al.  Prequential probability: principles and properties , 1999 .

[101]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[102]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[103]  Bernhard Schölkopf,et al.  Shrinking the Tube: A New Support Vector Regression Algorithm , 1998, NIPS.

[104]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[105]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[106]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[107]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[108]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[109]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[110]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[111]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[112]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[113]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[114]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[115]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[116]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[117]  David A. Bell,et al.  The role of domain knowledge in data mining , 1995, CIKM '95.

[118]  Gerhard Widmer,et al.  Adapting to Drift in Continuous Domains (Extended Abstract) , 1995, ECML.

[119]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[120]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[121]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[122]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[123]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[124]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[125]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[126]  Paul E. Utgoff,et al.  Perceptron Trees : A Case Study in ybrid Concept epresentations , 1999 .

[127]  Douglas H. Fisher,et al.  A Case Study of Incremental Concept Induction , 1986, AAAI.

[128]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[129]  Ryszard S. Michalski Knowledge Repair Mechanisms: Evolution vs Revolution , 1985 .

[130]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[131]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[132]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[133]  Frank Harary,et al.  Graph Theory , 2016 .

[134]  W. Hoeffding Probability inequalities for sum of bounded random variables , 1963 .

[135]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .