LASH: Large-Scale Sequence Mining with Hierarchies

We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.

[1]  Jack Mostow,et al.  Inferring Selectional Preferences from Part-Of-Speech N-grams , 2012, EACL.

[2]  Wen Wang,et al.  The Use of Word N-Grams and Parts of Speech for Hierarchical Cluster Language Modeling , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[4]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[5]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[6]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[7]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[8]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[9]  Anne Laurent,et al.  Mining multidimensional and multilevel sequential patterns , 2010, TKDD.

[10]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[11]  Roi Blanco,et al.  Web usage mining with semantic analysis , 2013, WWW.

[12]  Maguelonne Teisseire,et al.  HYPE: mining hierarchical sequential patterns , 2006, DOLAP '06.

[13]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[14]  Randy Goebel,et al.  Web-Scale N-gram Models for Lexical Disambiguation , 2009, IJCAI.

[15]  Jianyong Wang,et al.  Efficiently Mining Closed Subsequences with Gap Constraints , 2008, SDM.

[16]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[17]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[18]  Yen-Liang Chen,et al.  A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases , 2008, Data Knowl. Eng..

[19]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[20]  Enhong Chen,et al.  Mining Concept Sequences from Large-Scale Search Logs for Context-Aware Query Suggestion , 2011, TIST.

[21]  Michael Gertz,et al.  Mining Spatio-temporal Patterns in the Presence of Concept Hierarchies , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[24]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[25]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[26]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[27]  Tony Cheng-Kui Huang Developing an Efficient Knowledge Discovering Model for Mining Fuzzy Multi-level Sequential Patterns in Sequence Databases , 2009, 2009 International Conference on New Trends in Information and Service Science.