Semantic annotation of frequent patterns

Using frequent patterns to analyze data has been one of the fundamental approaches in many data mining applications. Research in frequent pattern mining has so far mostly focused on developing efficient algorithms to discover various kinds of frequent patterns, but little attention has been paid to the important next step—interpreting the discovered frequent patterns. Although the compression and summarization of frequent patterns has been studied in some recent work, the proposed techniques there can only annotate a frequent pattern with nonsemantical information (e.g., support), which provides only limited help for a user to understand the patterns. In this article, we study the novel problem of generating semantic annotations for frequent patterns. The goal is to discover the hidden meanings of a frequent pattern by annotating it with in-depth, concise, and structured information. We propose a general approach to generate such an annotation for a frequent pattern by constructing its context model, selecting informative context indicators, and extracting representative transactions and semantically similar patterns. This general approach can well incorporate the user's prior knowledge, and has potentially many applications, such as generating a dictionary-like description for a pattern, finding synonym patterns, discovering semantic relations, and summarizing semantic classes of a set of frequent patterns. Experiments on different datasets show that our approach is effective in generating semantic pattern annotations.

[1]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[2]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[3]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[4]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[5]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[6]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[9]  Hui Fang,et al.  A Study of Statistical Methods for Function Prediction of Protein Motifs , 2004, Applied bioinformatics.

[10]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[11]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[13]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[14]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[15]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[16]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[17]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[18]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[21]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[22]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[23]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[24]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[25]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[26]  Geoffrey I. Webb Discovering significant patterns , 2008, Machine Learning.

[27]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[28]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[29]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[30]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[31]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[32]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2006, KDD '06.

[33]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[34]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[36]  Xin He,et al.  Automatically Generating Gene Summaries from Biomedical Literature , 2005, Pacific Symposium on Biocomputing.

[37]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[38]  Patrick Pantel,et al.  Induction of semantic classes from natural language text , 2001, KDD '01.

[39]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.