Computing n-gram statistics in MapReduce

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

[1]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[2]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[3]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Justin Zobel,et al.  Accurate discovery of co-derivative documents via duplicate text detection , 2006, Inf. Syst..

[6]  Jianfeng Gao,et al.  MSRLM: a Scalable Language Modeling Toolkit , 2007 .

[7]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[8]  Gerhard Weikum,et al.  Query Relaxation for Entity-Relationship Search , 2011, ESWC.

[9]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[10]  Srikanta J. Bedathur,et al.  Temporal index sharding for space-time efficiency in archive search , 2011, SIGIR.

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Oliver Grau,et al.  How Not to Be Seen - Inpainting Dynamic Objects in Crowded Scenes , 2011 .

[13]  W. Bruce Croft,et al.  Efficient indexing of repeated n-grams , 2011, WSDM '11.

[14]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[15]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[16]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[17]  John F. Roddick,et al.  Association mining , 2006, CSUR.

[18]  Aristides Gionis,et al.  Social Content Matching in MapReduce , 2011, Proc. VLDB Endow..

[19]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[20]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[21]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[22]  Gerhard Weikum,et al.  A Language Modeling Approach for Temporal Information Needs , 2010, ECIR.

[23]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[25]  Carsten Stoll Optical reconstruction of detailed animatable human body models , 2009 .

[26]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[27]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Hans-Peter Seidel,et al.  Construction of smooth maps with mean value coordinates , 2007 .

[29]  Heng Ji,et al.  New Tools for Web-Scale N-grams , 2010, LREC.

[30]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[31]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[32]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[33]  Rada Mihalcea,et al.  An Efficient Indexer for Large N-Gram Corpora , 2011, ACL.

[34]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[35]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[36]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[37]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[38]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[39]  Ming-Syan Chen,et al.  DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud , 2010, PAKDD.

[40]  Martin Theobald,et al.  Top-k query processing in probabilistic databases with non-materialized views , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[41]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[42]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[43]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[44]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[45]  Valerie Guralnik,et al.  Parallel tree-projection-based sequence mining algorithms , 2004, Parallel Comput..

[46]  Sivan Toledo,et al.  Characterizing the Performance of Flash Memory Storage Devices and Its Impact on Algorithm Design , 2008, WEA.

[47]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[48]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[49]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..