Extracting redundancy-aware top-k patterns

Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously.In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.

[1]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[2]  Erhan Erkut,et al.  The discrete p-Maxian location problem , 1990, Comput. Oper. Res..

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  John Wilkes,et al.  UNIX Disk Access Patterns , 1993, USENIX Winter.

[5]  S. S. Ravi,et al.  Heuristic and Special Case Algorithms for Dispersion Problems , 1994, Oper. Res..

[6]  Takeshi Tokuyama,et al.  Finding subsets maximizing minimum structures , 1995, SODA '95.

[7]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[8]  Refael Hassin,et al.  Approximation algorithms for maximum dispersion , 1997, Oper. Res. Lett..

[9]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[10]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[11]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[12]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[13]  B. Beckman,et al.  BizTalk Server 2000 Business Process Orchestration. , 2001 .

[14]  S. Jaroszewicz,et al.  A General Measure of Rule Interestingness , 2001, PKDD.

[15]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[16]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[18]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[19]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[20]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[21]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[22]  Yuanyuan Zhou,et al.  Association Proceedings of the Third USENIX Conference on File and Storage Technologies San Francisco , CA , USA March 31 – April 2 , 2004 , 2004 .

[23]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[24]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[25]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[26]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[27]  ChengXiang Zhai,et al.  Active feedback in ad hoc information retrieval , 2005, SIGIR '05.