Mining sophisticated patterns for classification and correlation analysis

Pattern mining has been a hot issue since it was first proposed for market basket analysis. Even though pattern mining is one of the oldest topic in data mining domain, there are still many ongoing challenges to overcome on this subject since the scale of the data size is getting bigger and the complexity of data structure is getting more complicated. This dissertation discusses several pattern mining tasks, challenges associated with them, and algorithm designs that overcome these challenges. Specifically, we design and implement techniques for (1) directly mining discriminative patterns from a numeric valued feature set of k-embedded edge subtrees given labeled training data, (2) mining top correlated patterns from transactional databases with low minimum support, and (3) mining flipping correlation patterns from transactional databases given item hierarchy. We evaluate our solutions by conducting comprehensive experiments on large-scale synthetic and real world datasets.

[1]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[2]  Alexandre Termier,et al.  DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm , 2008, IEEE Transactions on Knowledge and Data Engineering.

[3]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[4]  Ramdane Maamri,et al.  Ontology-Driven Method for Ranking Unexpected Rules , 2009, CIIA.

[5]  Songbo Tan,et al.  Using hypothesis margin to boost centroid text classifier , 2007, SAC '07.

[6]  Jiawei Han,et al.  Classification of software behaviors for failure detection: a discriminative pattern mining approach , 2009, KDD.

[7]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[9]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[10]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[11]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[12]  Justin Zobel,et al.  Using Relative Entropy for Authorship Attribution , 2006, AIRS.

[13]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[14]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[15]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[16]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[17]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[18]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[19]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[20]  Richard Power,et al.  Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages , 2006, ACL.

[21]  Howard J. Hamilton,et al.  Knowledge discovery and measures of interest , 2001 .

[22]  Antonio Miranda García,et al.  Function Words in Authorship Attribution Studies , 2007, Lit. Linguistic Comput..

[23]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[24]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[25]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[26]  Rasmus Pagh,et al.  Finding associations and computing similarity via biased pair sampling , 2009, Knowledge and Information Systems.

[27]  Xindong Wu,et al.  Efficient mining of both positive and negative association rules , 2004, TOIS.

[28]  J. Chaker,et al.  Genre Categorization of Web Pages , 2007 .

[29]  Kurt Hornik,et al.  Implications of Probabilistic Data Modeling for Mining Association Rules , 2005, GfKl.

[30]  Osmar R. Zaïane,et al.  Mining Positive and Negative Association Rules: An Approach for Confined Rules , 2004, PKDD.

[31]  Hui Xiong,et al.  Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery , 2004, Pacific Symposium on Biocomputing.

[32]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[35]  Jiawei Han,et al.  CoMine: efficient mining of correlated patterns , 2003, Third IEEE International Conference on Data Mining.

[36]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[37]  Andrew Turpin,et al.  Application of Information Retrieval Techniques for Source Code Authorship Attribution , 2009, DASFAA.

[38]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[39]  Gregory Piatetsky-Shapiro,et al.  The interestingness of deviations , 1994 .

[40]  Roger Mitton,et al.  Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[41]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[42]  Albrecht Zimmermann,et al.  CTC - correlating tree patterns for classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[43]  Jiawei Han,et al.  Re-examination of interestingness measures in pattern mining: a unified framework , 2010, Data Mining and Knowledge Discovery.

[44]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[45]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[46]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[47]  Wen-tau Yih,et al.  Raising the baseline for high-precision text classifiers , 2007, KDD '07.

[48]  Lei Zou,et al.  Mining Frequent Induced Subtrees by Prefix-Tree-Projected Pattern Growth , 2006, 2006 Seventh International Conference on Web-Age Information Management Workshops.

[49]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[50]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[51]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[52]  Hui Xiong,et al.  Top-k Correlation Computation , 2008, INFORMS J. Comput..

[53]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[54]  Hui Xiong,et al.  Scaling up top-K cosine similarity search , 2011, Data Knowl. Eng..

[55]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[56]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[57]  Jaideep Srivastava,et al.  Indirect Association: Mining Higher Order Dependencies in Data , 2000, PKDD.

[58]  H. Storch,et al.  Statistical Analysis in Climate Research , 2000 .

[59]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[60]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[61]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[62]  Kai Wang,et al.  A syntactic tree matching approach to finding similar questions in community-based qa services , 2009, SIGIR.

[63]  Giuseppe Psaila,et al.  Hierarchy-based mining of association rules in data warehouses , 2000, SAC '00.

[64]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[65]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[66]  Shamkant B. Navathe,et al.  Mining for strong negative associations in a large database of customer transactions , 1998, Proceedings 14th International Conference on Data Engineering.

[67]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[68]  Yun Chi,et al.  Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees , 2005, IEEE Trans. Knowl. Data Eng..

[69]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[70]  Wynne Hsu,et al.  Post-Analysis of Learned Rules , 1996, AAAI/IAAI, Vol. 1.

[71]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[72]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[73]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[74]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[75]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[76]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[77]  Albrecht Zimmermann,et al.  Tree2 - Decision Trees for Tree Structured Data , 2005, LWA.

[78]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[79]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[80]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[81]  Raj K. Singh Mining potentially interesting positive and negative association patterns: Beyond the support-confidence framework , 2009 .

[82]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[83]  David L. Hoover,et al.  Statistical Stylistics and Authorship Attribution: an Empirical Investigation , 2001, Lit. Linguistic Comput..

[84]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[85]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[86]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[87]  Ying Zhao,et al.  Authorship Attribution Via Combination of Evidence , 2007, ECIR.

[88]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[89]  Alok Baveja,et al.  Computing , Artificial Intelligence and Information Technology A data-driven software tool for enabling cooperative information sharing among police departments , 2002 .

[90]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.