Approximate Scalable Bounded Space Sketch for Large Data NLP

We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply Count-Min sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[3]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[4]  Ashwin Lall,et al.  Streaming Pointwise Mutual Information , 2009, NIPS.

[5]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[6]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[7]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[8]  Gideon S. Mann,et al.  Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria , 2009, ACL/IJCNLP.

[9]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[10]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[11]  Chris Callison-Burch,et al.  Stream-based Translation Models for Statistical Machine Translation , 2010, NAACL.

[12]  Stuart E. Schechter,et al.  Popularity Is Everything: A New Approach to Protecting Passwords from Statistical-Guessing Attacks , 2010, HotSec.

[13]  Noah A. Smith,et al.  Covariance in Unsupervised Learning of Probabilistic Grammars , 2010, J. Mach. Learn. Res..

[14]  Suresh Venkatasubramanian,et al.  Sketching Techniques for Large Scale NLP , 2010, WAC@NAACL-HLT.

[15]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[16]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[17]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[18]  Daumé,et al.  Sketch Techniques for Scaling Distributional Similarity to the Web , 2010 .

[19]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[20]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[21]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[22]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.

[23]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[24]  Philip S. Yu,et al.  On Classification of High-Cardinality Data Streams , 2010, SDM.

[25]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[26]  Mark Johnson,et al.  Using Universal Linguistic Knowledge to Guide Grammar Induction , 2010, EMNLP.

[27]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[28]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[29]  Moni Naor,et al.  Pan-Private Streaming Algorithms , 2010, ICS.

[30]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[31]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[32]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[33]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[34]  Miles Osborne,et al.  Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[35]  Zellig S. Harris,et al.  Distributional Structure , 1954 .