Towards Accurate and Efficient Classification: A Discriminative and Frequent Pattern-Based Approach

Classification is a core method widely studied in machine learning, statistics, and data mining. A lot of classification methods have been proposed in literature, such as Support Vector Machines, Decision Trees, and Bayesian Networks, most of which assume that the input data is in a feature vector representation. However, in some classification problems, the predefined feature space is not discriminative enough to distinguish between different classes. More seriously, in many other applications, the input data has very complex structures, but with no initial feature vector representation, such as transaction data (e.g., customer shopping transactions), sequences (e.g., protein sequences and software execution traces), graphs (e.g., chemical compounds and molecules, social and biological networks), semi-structured data (e.g., XML documents), and text data. For both scenarios, a primary question is how to construct a discriminative and compact feature set, on the basis of which, classification could be performed to achieve good classification performance. Although a lot of kernel-based approaches have been proposed to transform the feature space and, as a way to measure the similarity between two data objects, the implicit definition of feature space makes the kernel-based approach hard to interpret, and the high computational complexity makes it hard to scale to large problem sizes. A concrete example of complex structural data classification is classifying chemical compounds to various classes ( e.g., toxic vs. nontoxic, active vs. inactive), where a key challenge is how to construct discriminative graph features. While simple features such as atoms and links are too simple to preserve the structural information, graph kernel methods make it hard to interpret the classifiers. In this dissertation, I proposed to use frequent patterns as higher-order and discriminative features to characterize data, especially complex structural data, and thus enhance the classification power. Towards this goal, I designed a framework of discriminative frequent pattern-based classification which has been shown to improve the classification performance significantly. Theoretical analysis is provided to reveal the association between a feature's frequency and its discriminative power, thus demonstrate that frequent pattern is a good candidate as discriminative feature. Due to the explosive nature of frequent pattern mining, the frequent pattern-based feature construction could be a computational bottleneck, if the whole set of frequent patterns w.r.t. a minimum support threshold are generated. To overcome this computational bottleneck, I proposed two solutions: DDPMine and LEAP which directly mine the most discriminative features without generating the complete set. Both methods have been shown to improve efficiency while maintaining the classification accuracy. I further applied the discriminative frequent pattern-based classification to classifying chemical compounds with very skewed class distribution, which poses challenges for both feature construction and model learning. An ensemble framework which includes the ensembles in both the data space and the feature space is proposed to handle the challenges and shown to achieve good classification performance. In conclusion, the framework of discriminative frequent pattern-based classification could lead to a highly accurate, efficient and interpretable classifier on complex data. The pattern-based classification technique would have great impact in a wide range of applications including text categorization, chemical compound classification, software behavior analysis and so on.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[3]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[4]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[5]  R. Karp,et al.  Conserved pathways within bacteria and yeast as revealed by global protein network alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[8]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[9]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[10]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[12]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[13]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[14]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[15]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[17]  John Shawe-Taylor,et al.  Syllables and other String Kernel Extensions , 2002, ICML.

[18]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.

[19]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[21]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[22]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[23]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[24]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[25]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[26]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[27]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[28]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[29]  Steven Skiena,et al.  Recognizing small subgraphs , 1995, Networks.

[30]  Hongyan Liu,et al.  Mining Interesting Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach , 2006, SDM.

[31]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[32]  Kagan Tumer,et al.  Analysis of decision boundaries in linearly combined neural classifiers , 1996, Pattern Recognit..

[33]  Andreas Zell,et al.  Optimal assignment kernels for attributed molecular graphs , 2005, ICML.

[34]  George Karypis,et al.  Comparison of descriptor spaces for chemical compound retrieval and classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[35]  Albrecht Zimmermann,et al.  Tree2 - Decision Trees for Tree Structured Data , 2005, LWA.

[36]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[37]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[38]  Amedeo Napoli,et al.  Mining Frequent Most Informative Subgraphs , 2007 .

[39]  Mohammad Al Hasan,et al.  ORIGAMI: Mining Representative Orthogonal Graph Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[40]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[41]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[42]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[43]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[44]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[45]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[46]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[47]  Rajjan Shinghal,et al.  Evaluating the Interestingness of Characteristic Rules , 1996, KDD.

[48]  Luc De Raedt,et al.  Feature Construction with Version Spaces for Biochemical Applications , 2001, ICML.

[49]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[50]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[51]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[52]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.