论文信息 - LASH: Large-Scale Sequence Mining with Hierarchies

LASH: Large-Scale Sequence Mining with Hierarchies

We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.

Rainer Gemulla | Kaustubh Beedkar | Rainer Gemulla | Kaustubh Beedkar

[1] Jack Mostow,et al. Inferring Selectional Preferences from Part-Of-Speech N-grams , 2012, EACL.

[2] Wen Wang,et al. The Use of Word N-Grams and Parts of Speech for Hierarchical Cluster Language Modeling , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3] Jaideep Srivastava,et al. Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[4] Jiawei Han,et al. BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[5] Miles Osborne,et al. Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[6] Ramakrishnan Srikant,et al. Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[7] Johannes Gehrke,et al. Sequential PAttern mining using a bitmap representation , 2002, KDD.

[8] Gerhard Weikum,et al. PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[9] Anne Laurent,et al. Mining multidimensional and multilevel sequential patterns , 2010, TKDD.

[10] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[11] Roi Blanco,et al. Web usage mining with semantic analysis , 2013, WWW.

[12] Maguelonne Teisseire,et al. HYPE: mining hierarchical sequential patterns , 2006, DOLAP '06.

[13] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[14] Randy Goebel,et al. Web-Scale N-gram Models for Lexical Disambiguation , 2009, IJCAI.

[15] Jianyong Wang,et al. Efficiently Mining Closed Subsequences with Gap Constraints , 2008, SDM.

[16] Ramakrishnan Srikant,et al. Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[17] Slav Petrov,et al. Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[18] Yen-Liang Chen,et al. A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases , 2008, Data Knowl. Eng..

[19] Umeshwar Dayal,et al. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[20] Enhong Chen,et al. Mining Concept Sequences from Large-Scale Search Logs for Context-Aware Query Suggestion , 2011, TIST.

[21] Michael Gertz,et al. Mining Spatio-temporal Patterns in the Presence of Concept Hierarchies , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[22] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23] Klaus Berberich,et al. Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[24] Mohammed J. Zaki,et al. SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[25] Ramakrishnan Srikant,et al. Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[26] Gerhard Weikum,et al. Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[27] Tony Cheng-Kui Huang. Developing an Efficient Knowledge Discovering Model for Mining Fuzzy Multi-level Sequential Patterns in Sequence Databases , 2009, 2009 International Conference on New Trends in Information and Service Science.