Lossy Conservative Update (LCU) Sketch: Succinct Approximate Count Storage

In this paper, we propose a variant of the conservative-update Count-Min sketch to further reduce the over-estimation error incurred. Inspired by ideas from lossy counting, we divide a stream of items into multiple windows, and decrement certain counts in the sketch at window boundaries. We refer to this approach as a lossy conservative update (LCU). The reduction in over-estimation error of counts comes at the cost of introducing under-estimation error in counts. However, in our intrinsic evaluations, we show that the reduction in over-estimation is much greater than the under-estimation error introduced by our method LCU. We apply our LCU framework to scale distributional similarity computations to web-scale corpora. We show that this technique is more efficient in terms of memory, and time, and more robust than conservative update with Count-Min (CU) sketch on this task.

[1]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[2]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[3]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[4]  Ashwin Lall,et al.  Streaming Pointwise Mutual Information , 2009, NIPS.

[5]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[6]  Graham Cormode,et al.  Count-Min Sketch , 2016, Encyclopedia of Algorithms.

[7]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[8]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[9]  Peter D. Turney A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[10]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[11]  Miles Osborne,et al.  Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[12]  Suresh Venkatasubramanian,et al.  Sketching Techniques for Large Scale NLP , 2010, WAC@NAACL-HLT.

[13]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[14]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[15]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[16]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[17]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[18]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[19]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[20]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[21]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[22]  Ashwin Lall,et al.  Efficient Online Locality Sensitive Hashing via Reservoir Counting , 2011, ACL.

[23]  Daumé,et al.  Sketch Techniques for Scaling Distributional Similarity to the Web , 2010 .

[24]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.