Efficient mining of correlated sequential patterns based on null hypothesis

Frequent pattern mining has been a widely studied topic in the research area of data mining for more than a decade. However, pattern mining with real data sets is complicated - a huge number of co-occurrence patterns are usually generated, a majority of which are either redundant or uninformative. The true correlation relationships among data objects are buried deep among a large pile of useless information. To overcome this difficulty, mining correlations has been recognized as an important data mining task for its many advantages over mining frequent patterns. In this paper, we formally propose and define the task of mining frequent correlated sequential patterns from a sequential database. With this aim in mind, we re-examine various interestingness measures to select the appropriate one(s), which can disclose succinct relationships of sequential patterns. We then propose PSBSpan, an efficient mining algorithm based on the framework of the pattern-growth methodology which mines frequent correlated sequential patterns. Our experimental study on real datasets shows that our algorithm has outstanding performance in terms of both efficiency and effectiveness.

[1]  Jeffrey Xu Yu,et al.  Top-k Correlative Graph Mining , 2009, SDM.

[2]  Jiawei Han,et al.  On effective presentation of graph patterns: a structural representative approach , 2008, CIKM '08.

[3]  Shijie Zhang,et al.  RING: An Integrated Method for Frequent Representative Subgraph Mining , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[4]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[5]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[6]  Wilfred Ng,et al.  Mining quantitative correlated patterns using an information-theoretic approach , 2006, KDD '06.

[7]  Jeffrey Xu Yu,et al.  Efficient Discovery of Frequent Correlated Subgraph Pairs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[8]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[9]  Michael Q. Zhang,et al.  Computing exact p-values for DNA motifs ( Part I ) , 2006 .

[10]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[11]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[12]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  PeiJian,et al.  Mining frequent patterns by pattern-growth , 2000 .

[14]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Mohammad Al Hasan,et al.  ORIGAMI: Mining Representative Orthogonal Graph Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[16]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[17]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[18]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[19]  Jiawei Han,et al.  Re-examination of interestingness measures in pattern mining: a unified framework , 2010, Data Mining and Knowledge Discovery.

[20]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[21]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[22]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[23]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[24]  Geoffrey I. Webb Self-sufficient itemsets: An approach to screening potentially interesting associations between items , 2010, TKDD.

[25]  Bo Zhao,et al.  PET: a statistical model for popular events tracking in social communities , 2010, KDD.

[26]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[27]  Jian Pei,et al.  Constrained frequent pattern mining: a pattern-growth view , 2002, SKDD.

[28]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[29]  Jiebo Luo,et al.  Diversified Trajectory Pattern Ranking in Geo-tagged Social Media , 2011, SDM.

[30]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[31]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[32]  Sangkyum Kim,et al.  Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures , 2011, ECML/PKDD.

[33]  Jiawei Han,et al.  CoMine: efficient mining of correlated patterns , 2003, Third IEEE International Conference on Data Mining.

[34]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[35]  Jiawei Han,et al.  The Joint Inference of Topic Diffusion and Evolution in Social Communities , 2011, 2011 IEEE 11th International Conference on Data Mining.

[36]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.