Sketching Techniques for Large Scale NLP

In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in the context of streaming data. We explore sketch techniques, especially the Count-Min Sketch, which approximates the frequency of a word pair in the corpus without explicitly storing the word pairs themselves. We use the idea of a conservative update with the Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB RAM). The number of these counters is up to 30 times less than the stream size which is a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.

[1]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[2]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[3]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[4]  Siddharth Patwardhan,et al.  Learning Domain-Specific Information Extraction Patterns from the Web , 2006 .

[5]  Miles Osborne,et al.  Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap , 2007, EMNLP.

[6]  Miles Osborne,et al.  Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[7]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[8]  Kenneth Ward Church,et al.  A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations , 2007, CL.

[9]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[10]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[11]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[12]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[13]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[14]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[15]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[16]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[17]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[18]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, SIGCOMM '02.

[19]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[20]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[21]  Ashwin Lall,et al.  Streaming Pointwise Mutual Information , 2009, NIPS.

[22]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[23]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.