Adaptive learning and mining for data streams and frequent patterns

This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naíve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full condence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.

[1]  Albert Carles Bifet Figuerol,et al.  Adaptive parameter-free learning from evolving data streams , 2009 .

[2]  Wilfred Ng,et al.  Maintaining frequent closed itemsets over a sliding window , 2008, Journal of Intelligent Information Systems.

[3]  Philip M. Long,et al.  Tracking drifting concepts by minimizing disagreements , 2004, Machine Learning.

[4]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Ludmila I. Kuncheva,et al.  A framework for generating data to simulate changing environments , 2007, Artificial Intelligence and Applications.

[7]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  G. Linoff,et al.  Mining the Web: Transforming Customer Data into Customer Value , 2002 .

[9]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[10]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[11]  Sankar K. Pal,et al.  Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery, and Soft Granular Computing , 2004 .

[12]  Ronald L. Rivest,et al.  Learning Time-Varying Concepts , 1990, NIPS.

[13]  Hiroki Arimura,et al.  Discovering Frequent Substructures in Large Unordered Trees , 2003, Discovery Science.

[14]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[15]  José L. Balcázar,et al.  Mining Implications from Lattices of Closed Trees , 2008, EGC.

[16]  Alexandre Termier,et al.  DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm , 2008, IEEE Transactions on Knowledge and Data Engineering.

[17]  Gabriel Valiente,et al.  Algorithms on Trees and Graphs , 2002, Springer Berlin Heidelberg.

[18]  Kiyoko F. Aoki-Kinoshita,et al.  A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology , 2008, TKDD.

[19]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[20]  F. Luccio,et al.  Exact Rooted Subtree Matching in Sublinear Time , 2001 .

[21]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[22]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[23]  Yun Chi,et al.  HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[24]  Bart Selman,et al.  Horn Approximations of Empirical Data , 1995, Artif. Intell..

[25]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[26]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[27]  Patricia S. O Sullivan,et al.  100 Statistical Tests , 1995 .

[28]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[29]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[30]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[31]  Ricard Gavaldà,et al.  Mining adaptively frequent closed unlabeled rooted trees in data streams , 2008, KDD.

[32]  Shin-Ichi Nakano,et al.  Efficient Generation of Rooted Trees , 2003 .

[33]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[34]  Karl Henrik Johansson,et al.  Some modeling and estimation issues in control of heterogeneous networks , 2004 .

[35]  Nan Jiang,et al.  CFI-Stream: mining closed frequent itemsets in data streams , 2006, KDD '06.

[36]  Kevin C. Almeroth,et al.  Modeling the branching characteristics and efficiency gains in global multicast trees , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[37]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[38]  D. Knuth,et al.  Generating all trees : history of combinatorial generation , 2006 .

[39]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[40]  José L. Balcázar,et al.  Discrete Deterministic Data Mining as Knowledge Compilation , 2003 .

[41]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[42]  M. Wild A Theory of Finite Closure Spaces Based on Implications , 1994 .

[43]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[44]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[45]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[46]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[47]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[48]  Sandra Mitchell Hedetniemi,et al.  Constant Time Generation of Rooted Trees , 1980, SIAM J. Comput..

[49]  Joost N. Kok,et al.  Efficient discovery of frequent unordered trees , 2003 .

[50]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[51]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[52]  Gemma C. Garriga,et al.  Horn axiomatizations for sequential data , 2007, Theor. Comput. Sci..

[53]  Gemma C. Garriga,et al.  Characterizing Implications of Injective Partial Orders , 2007, ICCS.

[54]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[55]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[56]  José L. Balcázar,et al.  Mining Frequent Closed Unordered Trees Through Natural Representations , 2007, ICCS.

[57]  José L. Balcázar,et al.  Mining frequent closed rooted trees , 2009, Machine Learning.

[58]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[59]  F. Luccio,et al.  BOTTOM-UP SUBTREE ISOMORPHISM FOR UNORDERED LABELED TREES , 2004 .

[60]  Rina Dechter,et al.  Structure Identification in Relational Data , 1992, Artif. Intell..

[61]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[62]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[63]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[64]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[65]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[66]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[67]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[68]  Ingmar Weber,et al.  An Analysis of Factors Used in Search Engine Ranking , 2005, AIRWeb.

[69]  Feng Gao,et al.  Towards Generic Pattern Mining , 2005, ICFCA.

[70]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[71]  F. Gustafsson,et al.  Lane departure detection for improved road geometry estimation , 2006, 2006 IEEE Intelligent Vehicles Symposium.

[72]  Jianyong Wang,et al.  Efficient Mining of Frequent Closed XML Query Pattern , 2007, Journal of Computer Science and Technology.

[73]  Ricard Gavaldà,et al.  Adaptive XML Tree Classification on Evolving Data Streams , 2009, ECML/PKDD.

[74]  Alexandre Termier,et al.  Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[75]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[76]  Shai Ben-David,et al.  Learning Changing Concepts by Exploiting the Structure of Change , 1996, COLT '96.

[77]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[78]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[79]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[80]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[81]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[82]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[83]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[84]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[85]  S. W. Roberts,et al.  Control Chart Tests Based on Geometric Moving Averages , 2000, Technometrics.

[86]  Gemma C. Garriga,et al.  Coproduct Transformations on Lattices of Closed Partial Orders , 2004, ICGT.

[87]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[88]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[89]  Cynthia Rudin,et al.  Online coordinate boosting , 2008, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[90]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[91]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[92]  Arbee L. P. Chen,et al.  Discovering Frequent Tree Patterns over Data Streams , 2006, SDM.

[93]  Carlo Zaniolo,et al.  Fast and Light Boosting for Adaptive Mining of Data Streams , 2004, PAKDD.

[94]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[95]  B. P. Maloy Managing for the Future: The 1990s and Beyond , 1993 .

[96]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[97]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[98]  Fredrik Gustafsson,et al.  Adaptive filtering and change detection , 2000 .

[99]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[100]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[101]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[102]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[103]  Tyng-Luh Liu,et al.  Approximate tree matching and shape similarity , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[104]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[105]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[106]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[107]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[108]  Baihua Zheng,et al.  CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data , 2007, DASFAA.

[109]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[110]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2006, PKDD.

[111]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[112]  José L. Balcázar,et al.  Closed and Maximal Tree Mining Using Natural Representations , 2007 .

[113]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[114]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[115]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[116]  Suh-Yin Lee,et al.  Online mining of frequent query trees over XML data streams , 2006, WWW '06.

[117]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[118]  Tao Jiang,et al.  On the Complexity of Comparing Evolutionary Trees , 1996, Discret. Appl. Math..

[119]  José L. Balcázar,et al.  Intersection Algorithms and a Closure Operator on Unordered Trees , 2006 .

[120]  Geoff Holmes,et al.  Stress-Testing Hoeffding Trees , 2005, PKDD.

[121]  Claudio Lucchese,et al.  High performance: closed frequent itemsets mining inspired by emerging computer architectures , 2008 .

[122]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[123]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[124]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[125]  Mohammed J. Zaki,et al.  LOGML: Log Markup Language for Web Usage Mining , 2001, WEBKDD.

[126]  Yun Chi,et al.  Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees , 2005, IEEE Trans. Knowl. Data Eng..

[127]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[128]  Hiroki Arimura,et al.  Online algorithms for mining semi-structured data stream , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[129]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[130]  L. Beran,et al.  [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[131]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[132]  Hiroki Arimura,et al.  An Output-Polynomial Time Algorithm for Mining Frequent Closed Attribute Trees , 2005, ILP.

[133]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[134]  Mohammed J. Zaki Efficiently Mining Frequent Embedded Unordered Trees , 2004, Fundam. Informaticae.

[135]  Yun Chi,et al.  Canonical forms for labelled trees and their applications in frequent subtree mining , 2005, Knowledge and Information Systems.

[136]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[137]  José L. Balcázar,et al.  Subtree Testing and Closed Tree Mining Through Natural Representations , 2007 .

[138]  Yuji Matsumoto,et al.  A Boosting Algorithm for Classification of Semi-Structured Text , 2004, EMNLP.

[139]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[140]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[141]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[142]  Yong Shi,et al.  Categorizing and mining concept drifting data streams , 2008, KDD.