Scalable Frequent Sequence Mining with Flexible Subsequence Constraints

We study scalable algorithms for frequent sequence mining under flexible subsequence constraints. Such constraints enable applications to specify concisely which patterns are of interest and which are not. We focus on the bulk synchronous parallel model with one round of communication; this model is suitable for platforms such as MapReduce or Spark. We derive a general framework for frequent sequence mining under this model and propose the D-SEQ and D-CAND algorithms within this framework. The algorithms differ in what data are communicated and how computation is split up among workers. To the best of our knowledge, D-SEQ and D-CAND are the first scalable algorithms for frequent sequence mining with flexible constraints. We conducted an experimental study on multiple real-world datasets that suggests that our algorithms scale nearly linearly, outperform common baselines, and offer acceptable generalization overhead over existing, less general mining algorithms.

[1]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[2]  Jean-François Boulicaut,et al.  Mining Frequent Sequential Patterns under Regular Expressions: A Highly Adaptive Strategy for Pushing Contraints , 2003, SDM.

[3]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[4]  Bart Goethals,et al.  Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[6]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[7]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Valerie Guralnik,et al.  Parallel tree-projection-based sequence mining algorithms , 2004, Parallel Comput..

[10]  Tao Jiang,et al.  Minimal NFA Problems are Hard , 1991, SIAM J. Comput..

[11]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[12]  Dominique Revuz,et al.  Minimisation of Acyclic Deterministic Automata in Linear Time , 1992, Theor. Comput. Sci..

[13]  Klaus Berberich,et al.  Closing the Gap: Sequence Mining at Scale , 2015, TODS.

[14]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[15]  Dino Pedreschi,et al.  Efficient Mining of Temporally Annotated Sequences , 2006, SDM.

[16]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[17]  Yang Dong MINING SEQUENTIAL PATTERNS IN WEB LOGS , 2000 .

[18]  Wim Martens,et al.  A Unified Framework for Frequent Sequence Mining with Subsequence Constraints , 2019, ACM Trans. Database Syst..

[19]  Jiadong Ren,et al.  Mining sequential patterns with periodic wildcard gaps , 2014, Applied Intelligence.

[20]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[21]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, Large-Scale Parallel Data Mining.

[22]  Rainer Gemulla,et al.  DESQ: Frequent Sequence Mining with Subsequence Constraints , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[23]  Valerie Guralnik,et al.  Parallel Tree Projection Algorithm for Sequence Mining , 2001, Euro-Par.

[24]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[25]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.