Incremental Maintenance on the Border of the Space of Emerging Patterns

Emerging patterns (EPs) are useful knowledge patterns with many applications. In recent studies on bio-medical profiling data, we have successfully used such patterns to solve difficult cancer diagnosis problems and produced higher classification accuracy when compared to alternative methods. However, the discovery of EPs is a challenging and computationally expensive problem.In this paper, we study how to incrementally modify and maintain the concise boundary descriptions of the space of all emerging patterns when small changes occur to the data. As EP spaces are convex, the maintenance on the bounds guarantees that no desired patterns are lost. We introduce algorithms to handle four types of changes: insertion of new data, deletion of old data, addition of new attributes, and deletion of old attributes. We compare these incremental algorithms, on six benchmark data sets, against an efficient algorithm that computes from scratch. The results show that the incremental algorithms are much faster than the From-Scratch method, often with tremendous speed-up rates.

[1]  Graham J. Williams,et al.  On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms , 2000, KDD '00.

[2]  Jaideep Srivastava,et al.  Event detection from time series data , 1999, KDD '99.

[3]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[4]  Kotagiri Ramamohanarao,et al.  Instance-Based Classification by Emerging Patterns , 2000, PKDD.

[5]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[6]  Haym Hirsh,et al.  Learning DNF Via Probabilistic Evidence Combination , 1993, ICML.

[7]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[8]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[9]  Yizhak Idan,et al.  Discovery of fraud rules for telecommunications—challenges and solutions , 1999, KDD '99.

[10]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[11]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[12]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective Fraud Detection , 1997 .

[13]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[14]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Geoffrey J. McLachlan,et al.  On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures , 2003, Stat. Comput..

[19]  Nils J. Nilsson,et al.  MLC++, A Machine Learning Library in C++. , 1995 .

[20]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[21]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[22]  Salvatore J. Stolfo,et al.  Mining Audit Data to Build Intrusion Detection Models , 1998, KDD.

[23]  Jiawei Han,et al.  Attribute-Oriented Induction in Relational Databases , 1991, Knowledge Discovery in Databases.

[24]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[25]  Jinyan Li,et al.  Geography of Differences between Two Classes of Data , 2002, PKDD.

[26]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[27]  Devika Subramanian,et al.  The Common Order-Theoretic Structure of Version Spaces and ATMSs , 1991, Artif. Intell..

[28]  Kotagiri Ramamohanarao,et al.  DeEPs: A New Instance-Based Lazy Discovery and Classification System , 2004, Machine Learning.

[29]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[30]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[31]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[32]  Kotagiri Ramamohanarao,et al.  Making Use of the Most Expressive Jumping Emerging Patterns for Classification , 2000, Knowledge and Information Systems.

[33]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[34]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[35]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[36]  Graham J. Williams,et al.  Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases , 1997, Australian Joint Conference on Artificial Intelligence.

[37]  Michèle Sebag,et al.  Delaying the Choice of Bias: A Disjunctive Version Space Approach , 1996, ICML.

[38]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[39]  Tom M. Mitchell,et al.  Version Spaces: A Candidate Elimination Approach to Rule Learning , 1977, IJCAI.

[40]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[41]  I. Grabec Self-organization of neurons described by the maximum-entropy principle , 1990, Biological Cybernetics.

[42]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[44]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[45]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[46]  Kotagiri Ramamohanarao,et al.  The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms , 2000, ICML.

[47]  Dino Pedreschi,et al.  A classification-based methodology for planning audit strategies in fraud detection , 1999, KDD '99.

[48]  John Shawe-Taylor,et al.  Detecting Cellular Fraud Using Adaptive Prototypes. , 1997, AAAI 1997.

[49]  Peter J. Braspenning,et al.  Version Space Learning with Instance-Based Boundary Sets , 1998, ECAI.

[50]  Haym Hirsh,et al.  Generalizing Version Spaces , 1994, Machine Learning.

[51]  Jiawei Han,et al.  Exploration of the power of attribute-oriented induction in data mining , 1995, KDD 1995.

[52]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[53]  Salvatore J. Stolfo,et al.  Mining in a data-flow environment: experience in network intrusion detection , 1999, KDD '99.