Principles of Data Mining and Knowledge Discovery

In this paper, we consider the problem of discovering interesting substructures from a large collection of semi-structured data in the framework of optimized pattern discovery. We model semi-structured data and patterns with labeled ordered trees, and present an efficient algorithm that discovers the best labeled ordered trees that optimize a given statistical measure, such as the information entropy and the classification accuracy, in a collection of semi-structured data. We give theoretical analyses of the computational complexity of the algorithm for patterns with bounded and unbounded size. Experiments show that the algorithm performs well and discovered interesting patterns on real datasets.

[1]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[2]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[3]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[4]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[5]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[6]  Marek Wojciechowski Interactive Constraint-Based Sequential Pattern Mining , 2001, ADBIS.

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Kenji Yamanishi,et al.  Discovering outlier filtering rules from unlabeled data , 2001, KDD 2001.

[9]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[10]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[11]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[12]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[13]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[14]  Gerard Salton,et al.  Automatic indexing , 1980, ACM '80.

[15]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[16]  Eyke Hüllermeier Fuzzy Association Rules: Semantic Issues and Quality Measures , 2001, Fuzzy Days.

[17]  H. V. Jagadzsh Linear Clustering of Objects with Multiple Attributes , 1998 .

[18]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[19]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[20]  Yehuda Lindell,et al.  A Statistical Theory for Quantitative Association Rules , 1999, KDD '99.

[21]  Nimrod Megiddo,et al.  Fast indexing method for multidimensional nearest-neighbor search , 1998, Electronic Imaging.

[22]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[23]  Zbigniew R. Struzik,et al.  Outlier detection and localisation with wavelet based multifractal formalism , 2000 .

[24]  David J. DeWitt,et al.  Using a knowledge cache for interactive discovery of association rules , 1999, KDD '99.

[25]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[26]  Giuseppe Psaila,et al.  An Extension to SQL for Mining Association Rules , 1998, Data Mining and Knowledge Discovery.

[27]  Yasuhiko Morimoto,et al.  Mining optimized association rules for numeric attributes , 1996, J. Comput. Syst. Sci..

[28]  Tadeusz Morzy,et al.  Materialized Data Mining Views , 2000, PKDD.

[29]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[30]  Witold Pedrycz,et al.  Data mining and fuzzy modeling , 1996, Proceedings of North American Fuzzy Information Processing.

[31]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[32]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[33]  Eyke Hüllermeier,et al.  Implication-Based Fuzzy Association Rules , 2001, PKDD.

[34]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.