Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams

This book is a significant contribution to the subject of mining time-changing data streams and addresses the design of learning algorithms for this purpose. It introduces new contributions on several different aspects of the problem, identifying research opportunities and increasing the scope for applications. It also includes an in-depth study of stream mining and a theoretical analysis of proposed methods and algorithms. The first section is concerned with the use of an adaptive sliding window algorithm (ADWIN). Since this has rigorous performance guarantees, using it in place of counters or accumulators, it offers the possibility of extending such guarantees to learning and mining algorithms not initially designed for drifting data. Testing with several methods, including Naive Bayes, clustering, decision trees and ensemble methods, is discussed as well. The second part of the book describes a formal study of connected acyclic graphs, or 'trees', from the point of view of closure-based mining, presenting efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. Lastly, a general methodology to identify closed patterns in a data stream is outlined. This is applied to develop an incremental method, a sliding-window based method, and a method that mines closed trees adaptively from data streams. These are used to introduce classification methods for tree data streams.

[1]  Yun Chi,et al.  Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees , 2005, IEEE Trans. Knowl. Data Eng..

[2]  S. W. Roberts Control chart tests based on geometric moving averages , 2000 .

[3]  Karl Henrik Johansson,et al.  Some modeling and estimation issues in control of heterogeneous networks , 2004 .

[4]  Gemma C. Garriga,et al.  Coproduct Transformations on Lattices of Closed Partial Orders , 2004, ICGT.

[5]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[6]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[7]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[8]  Nan Jiang,et al.  CFI-Stream: mining closed frequent itemsets in data streams , 2006, KDD '06.

[9]  Kevin C. Almeroth,et al.  Modeling the branching characteristics and efficiency gains in global multicast trees , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[10]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[11]  F. Luccio,et al.  Exact Rooted Subtree Matching in Sublinear Time , 2001 .

[12]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[13]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[14]  Hiroki Arimura,et al.  Online algorithms for mining semi-structured data stream , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  Philip M. Long,et al.  Tracking Drifting Concepts By Minimizing Disagreements , 2004, Machine Learning.

[17]  Wilfred Ng,et al.  Maintaining frequent closed itemsets over a sliding window , 2008, Journal of Intelligent Information Systems.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Geoff Holmes,et al.  Stress-Testing Hoeffding Trees , 2005, PKDD.

[20]  F. Luccio,et al.  BOTTOM-UP SUBTREE ISOMORPHISM FOR UNORDERED LABELED TREES , 2004 .

[21]  Shin-Ichi Nakano,et al.  Efficient Generation of Rooted Trees , 2003 .

[22]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[23]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[24]  Hiroki Arimura,et al.  An Output-Polynomial Time Algorithm for Mining Frequent Closed Attribute Trees , 2005, ILP.

[25]  A. Arnold,et al.  Mathematics for computer science , 1996 .

[26]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming) , 2005 .

[27]  G. Linoff,et al.  Mining the Web: Transforming Customer Data into Customer Value , 2002 .

[28]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[29]  Rina Dechter,et al.  Structure Identification in Relational Data , 1992, Artif. Intell..

[30]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[31]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[32]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[33]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[34]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[35]  M. Wild A Theory of Finite Closure Spaces Based on Implications , 1994 .

[36]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[37]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[38]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[39]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[40]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[41]  José L. Balcázar,et al.  Mining Implications from Lattices of Closed Trees , 2008, EGC.

[42]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[43]  Alexandre Termier,et al.  DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm , 2008, IEEE Transactions on Knowledge and Data Engineering.

[44]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[45]  F. Gustafsson,et al.  Lane departure detection for improved road geometry estimation , 2006, 2006 IEEE Intelligent Vehicles Symposium.

[46]  Gemma C. Garriga,et al.  Horn axiomatizations for sequential data , 2007, Theor. Comput. Sci..

[47]  Kiyoko F. Aoki-Kinoshita,et al.  A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology , 2008, TKDD.

[48]  Gemma C. Garriga,et al.  Characterizing Implications of Injective Partial Orders , 2007, ICCS.

[49]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[50]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[51]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[52]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[53]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[54]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2003, J. Algorithms.

[55]  Carlo Zaniolo,et al.  Fast and Light Boosting for Adaptive Mining of Data Streams , 2004, PAKDD.

[56]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[57]  José L. Balcázar,et al.  Subtree Testing and Closed Tree Mining Through Natural Representations , 2007 .

[58]  Yuji Matsumoto,et al.  A Boosting Algorithm for Classification of Semi-Structured Text , 2004, EMNLP.

[59]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[60]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[61]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[62]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[63]  Baihua Zheng,et al.  CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data , 2007, DASFAA.

[64]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[65]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[66]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[67]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[68]  Tyng-Luh Liu,et al.  Approximate tree matching and shape similarity , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[69]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[70]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[71]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[72]  Ingmar Weber,et al.  An Analysis of Factors Used in Search Engine Ranking , 2005, AIRWeb.

[73]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[74]  Yong Shi,et al.  Categorizing and mining concept drifting data streams , 2008, KDD.

[75]  Ricard Gavaldà,et al.  Mining adaptively frequent closed unlabeled rooted trees in data streams , 2008, KDD.

[76]  Jianyong Wang,et al.  Efficient Mining of Frequent Closed XML Query Pattern , 2007, Journal of Computer Science and Technology.

[77]  Ricard Gavaldà,et al.  Adaptive XML Tree Classification on Evolving Data Streams , 2009, ECML/PKDD.

[78]  Alexandre Termier,et al.  Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[79]  Arbee L. P. Chen,et al.  Discovering Frequent Tree Patterns over Data Streams , 2006, SDM.

[80]  Tao Jiang,et al.  On the Complexity of Comparing Evolutionary Trees (Extended Abstract) , 1995, CPM.

[81]  Richard Granger,et al.  Incremental Learning from Noisy Data , 1986, Machine Learning.

[82]  Gabriel Valiente,et al.  Algorithms on Trees and Graphs , 2002, Springer Berlin Heidelberg.

[83]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[84]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[85]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[86]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[87]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[88]  Shai Ben-David,et al.  Learning Changing Concepts by Exploiting the Structure of Change , 1996, COLT '96.

[89]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[90]  Sandra Mitchell Hedetniemi,et al.  Constant Time Generation of Rooted Trees , 1980, SIAM J. Comput..

[91]  Joost N. Kok,et al.  Efficient discovery of frequent unordered trees , 2003 .

[92]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[93]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[94]  Fredrik Gustafsson,et al.  Adaptive filtering and change detection , 2000 .

[95]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[96]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[97]  Suh-Yin Lee,et al.  Online mining of frequent query trees over XML data streams , 2006, WWW '06.

[98]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[99]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[100]  Cynthia Rudin,et al.  Online coordinate boosting , 2008, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[101]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[102]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[103]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[104]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[105]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[106]  Mohammed J. Zaki,et al.  LOGML: Log Markup Language for Web Usage Mining , 2001, WEBKDD.

[107]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[108]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[109]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine-mediated learning.

[110]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[111]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[112]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[113]  Yun Chi,et al.  HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[114]  Bart Selman,et al.  Horn Approximations of Empirical Data , 1995, Artif. Intell..

[115]  José L. Balcázar,et al.  Mining Frequent Closed Unordered Trees Through Natural Representations , 2007, ICCS.

[116]  José L. Balcázar,et al.  Mining frequent closed rooted trees , 2009, Machine Learning.

[117]  Gopal Kanji,et al.  100 Statistical Tests , 1994 .

[118]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[119]  Mohammed J. Zaki Efficiently Mining Frequent Embedded Unordered Trees , 2004, Fundam. Informaticae.

[120]  Yun Chi,et al.  Canonical forms for labelled trees and their applications in frequent subtree mining , 2005, Knowledge and Information Systems.

[121]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[122]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[123]  Ludmila I. Kuncheva,et al.  A framework for generating data to simulate changing environments , 2007, Artificial Intelligence and Applications.

[124]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[125]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[126]  Sankar K. Pal,et al.  Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery, and Soft Granular Computing , 2004 .

[127]  Ronald L. Rivest,et al.  Learning Time-Varying Concepts , 1990, NIPS.

[128]  Hiroki Arimura,et al.  Discovering Frequent Substructures in Large Unordered Trees , 2003, Discovery Science.

[129]  Feng Gao,et al.  Towards Generic Pattern Mining , 2005, ICFCA.

[130]  Hisashi Kashima,et al.  Kernels for Semi-Structured Data , 2002, ICML.

[131]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[132]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .