Sketch Algorithms for Estimating Point Queries in NLP

Many NLP tasks rely on accurate statistics from large corpora. Tracking complete statistics is memory intensive, so recent work has proposed using compact approximate "sketches" of frequency distributions. We describe 10 sketch methods, including existing and novel variants. We compare and study the errors (over-estimation and underestimation) made by the sketches. We evaluate several sketches on three important NLP problems. Our experiments show that one sketch performs best for all the three tasks.

[1]  Ashwin Lall,et al.  Streaming Pointwise Mutual Information , 2009, NIPS.

[2]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[3]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[4]  David Talbot,et al.  Succinct Approximate Counting of Skewed Data , 2009, IJCAI.

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[7]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[8]  Nathanael Chambers,et al.  Improving the Use of Pseudo-Words for Evaluating Selectional Preferences , 2010, ACL.

[9]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[10]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[11]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[12]  Graham Cormode,et al.  Count-Min Sketch , 2016, Encyclopedia of Algorithms.

[13]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[14]  Hal Daumé,et al.  Lossy Conservative Update (LCU) Sketch: Succinct Approximate Count Storage , 2011, AAAI.

[15]  Katrin Erk,et al.  A Simple, Similarity-based Model for Selectional Preferences , 2007, ACL.

[16]  Randy Goebel,et al.  Discriminative Learning of Selectional Preference from Unlabeled Text , 2008, EMNLP.

[17]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[18]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[19]  Graham Cormode,et al.  Sketch Techniques for Approximate Query Processing , 2010 .

[20]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.

[21]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[22]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[23]  Moni Naor,et al.  Pan-Private Streaming Algorithms , 2010, ICS.

[24]  Philip S. Yu,et al.  On Classification of High-Cardinality Data Streams , 2010, SDM.

[25]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[26]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[27]  Hal Daumé,et al.  Approximate Scalable Bounded Space Sketch for Large Data NLP , 2011, EMNLP.

[28]  Miles Osborne,et al.  Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[29]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[30]  Stuart E. Schechter,et al.  Popularity Is Everything: A New Approach to Protecting Passwords from Statistical-Guessing Attacks , 2010, HotSec.

[31]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[32]  Miles Osborne,et al.  Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap , 2007, EMNLP.

[33]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[34]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[35]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[36]  Fan Deng New Estimation Algorithms for Streaming Data : Count-min Can Do More , 2022 .

[37]  Thorsten Brants,et al.  Randomized Language Models via Perfect Hash Functions , 2008, ACL.

[38]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.