论文信息 - Lossy Conservative Update (LCU) Sketch: Succinct Approximate Count Storage - 字舞流文

Lossy Conservative Update (LCU) Sketch: Succinct Approximate Count Storage

In this paper, we propose a variant of the conservative-update Count-Min sketch to further reduce the over-estimation error incurred. Inspired by ideas from lossy counting, we divide a stream of items into multiple windows, and decrement certain counts in the sketch at window boundaries. We refer to this approach as a lossy conservative update (LCU). The reduction in over-estimation error of counts comes at the cost of introducing under-estimation error in counts. However, in our intrinsic evaluations, we show that the reduction in over-estimation is much greater than the under-estimation error introduced by our method LCU. We apply our LCU framework to scale distributional similarity computations to web-scale corpora. We show that this technique is more efficient in terms of memory, and time, and more robust than conservative update with Count-Min (CU) sketch on this task.

Hal Daumé | Amit Goyal | Hal Daumé | Amit Goyal

[1] Florin Rusu,et al. Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[2] John B. Goodenough,et al. Contextual correlates of synonymy , 1965, CACM.

[3] J. R. Firth,et al. A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[4] Ashwin Lall,et al. Streaming Pointwise Mutual Information , 2009, NIPS.

[5] Cristian Estan,et al. New directions in traffic measurement and accounting , 2001, IMW '01.

[6] Graham Cormode,et al. Count-Min Sketch , 2016, Encyclopedia of Algorithms.

[7] Eneko Agirre,et al. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[8] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[9] Peter D. Turney. A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations , 2008, COLING.

[10] Ehud Rivlin,et al. Placing search in context: the concept revisited , 2002, TOIS.

[11] Miles Osborne,et al. Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[12] Suresh Venkatasubramanian,et al. Sketching Techniques for Large Scale NLP , 2010, WAC@NAACL-HLT.

[13] Patrick Pantel,et al. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[14] Thorsten Brants,et al. Large Language Models in Machine Translation , 2007, EMNLP.

[15] Suresh Venkatasubramanian,et al. Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[16] Rajeev Motwani,et al. Approximate Frequency Counts over Data Streams , 2012, VLDB.

[17] Moses Charikar,et al. Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[18] Eric Crestan,et al. Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[19] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[20] Zellig S. Harris,et al. Distributional Structure , 1954 .

[21] G. Miller,et al. Contextual correlates of semantic similarity , 1991 .

[22] Ashwin Lall,et al. Efficient Online Locality Sensitive Hashing via Reservoir Counting , 2011, ACL.

[23] Daumé,et al. Sketch Techniques for Scaling Distributional Similarity to the Web , 2010 .

[24] Ashwin Lall,et al. Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.