论文信息 - Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation

Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation

Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n− 1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass onehash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data.

Owen Kaser | Daniel Lemire | D. Lemire | Owen Kaser

[1] Jeffrey F. Naughton,et al. Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[2] Luca Trevisan,et al. Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[3] Jian Zhang,et al. On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[4] Eduard H. Hovy,et al. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[5] Anssi Klapuri,et al. Conventional and periodic N-grams in the transcription of drum sequences , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[6] N. Cercone. CNG Method with Weighted Voting , 2004 .

[7] Kamel Aouiche,et al. Unasssuming View-Size Estimation Techniques in OLAP , 2007, ArXiv.

[8] Richard M. Karp,et al. Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[9] Jeffrey F. Naughton,et al. Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[10] Mark Allen Weiss,et al. Data structures and algorithm analysis in Ada , 1993 .

[11] Sudipto Guha,et al. Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[12] Alon Orlitsky,et al. Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[13] George Karypis,et al. Selective Markov models for predicting Web page accesses , 2004, TOIT.

[14] Jae-Gil Lee,et al. n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2005, VLDB.

[15] Christos Faloutsos,et al. Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[16] P. Flajolet,et al. Loglog counting of large cardinalities , 2003 .

[17] Makoto Nagao,et al. A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[18] Andrew Rau-Chaplin,et al. The cgmCUBE project: Optimizing parallel data cube generation for ROLAP , 2006, Distributed and Parallel Databases.

[19] David A. McAllester,et al. On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[20] Toby J. Teorey,et al. A Pareto Model for OLAP View Size Estimation , 2001, Inf. Syst. Frontiers.

[21] Min Zhang,et al. Improving Language Model Size Reduction using Better Pruning Criteria , 2002, ACL.

[22] Ronitt Rubinfeld,et al. The complexity of approximating entropy , 2002, STOC '02.

[23] Claude E. Shannon,et al. A Mathematical Theory of Communications , 1948 .

[24] Stan Matwin,et al. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[25] Dan Sullivan,et al. Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales , 2001 .

[26] George Marsaglia,et al. Toward a universal random number generator , 1987 .

[27] Matteo Golfarelli,et al. On Estimating the Cardinality of Aggregate Views , 2001, DMDW.

[28] F. James. A Review of Pseudorandom Number Generators , 1990 .

[29] Emmanuel J. Yannakoudakis,et al. n-Grams and their implication to natural language understanding , 1990, Pattern Recognit..

[30] Ronitt Rubinfeld,et al. On the learnability of discrete distributions , 1994, STOC '94.

[31] Patrick Brennan,et al. A Prototype for Authorship Attribution Studies , 2006, Lit. Linguistic Comput..

[32] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[33] Jinho Lee,et al. On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[34] Gaston H. Gonnet,et al. An Analysis of the Karp-Rabin String Matching Algorithm , 1990, Inf. Process. Lett..

[35] Matteo Golfarelli,et al. Bounding the cardinality of aggregate views through domain-derived constraints , 2003, Data Knowl. Eng..

[36] Philippe Flajolet,et al. Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[37] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[38] Yorick Wilks,et al. The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora , 2002 .

[39] Qiang Yang,et al. WhatNext: a prediction system for Web requests using n-gram sequence models , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[40] Stefan M. Rüger,et al. Position Indexing of Adjacent and Concurrent N-Grams for Polyphonic Music Retrieval , 2003, ISMIR.

[41] R. P. Jagadeesh Chandra Bose,et al. Data mining approaches to software fault diagnosis , 2005, 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05).

[42] Wing-Kai Hon,et al. Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[43] Robert A. Stryk. Uniform random number generator , 1976, SIML.

[44] Frederick Jelinek,et al. Statistical methods for speech recognition , 1997 .

[45] Robert Giegerich,et al. Efficient implementation of lazy suffix trees , 1999, Softw. Pract. Exp..

[46] Michel Benard. Àjuste titre: a Lexicometric Approach to the Study of Titles , 1995 .

[47] Noga Alon,et al. The space complexity of approximating the frequency moments , 1996, STOC '96.

[48] Owen Kaser,et al. Analyzing Large Collections of Electronic Text Using OLAP , 2006, ArXiv.

[49] Peter Sanders,et al. Better external memory suffix array construction , 2008, JEAL.

[50] Douglas W. Oard,et al. Textual Data Mining to Support Science and Technology Management , 2000, Journal of Intelligent Information Systems.

[51] Larry Carter,et al. Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[52] Timo Niemi,et al. Multidimensional Data Model and Query Language for Informetrics , 2003, J. Assoc. Inf. Sci. Technol..

[53] Hamid Pirahesh,et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[54] Philippe Flajolet,et al. Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[55] Amita Goyal Chin,et al. Text databases & document management: theory & practice , 2001 .

[56] Aravind Srinivasan,et al. Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[57] Kim-Hung Li,et al. Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[58] Paul R. Cohen,et al. Unsupervised segmentation of categorical time series into episodes , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[59] Michael Droettboom. Correcting broken characters in the recognition of historical printed documents , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[60] Kyu-Young Whang,et al. A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[61] Xiaohui Yu,et al. Towards estimating the number of distinct value combinations for a set of attributes , 2005, CIKM '05.

[62] Srikanta Tirthapura,et al. Estimating simple functions on the union of data streams , 2001, SPAA '01.

[63] Panos M. Pardalos,et al. Handbook of Massive Data Sets , 2002, Massive Computing.

[64] Bernard Dousset,et al. DocCube: Multi-dimensional visualisation and exploration of large document sets , 2003, J. Assoc. Inf. Sci. Technol..

[65] Owen Kaser,et al. The LitOLAP Project: Data Warehousing with Literature , 2006 .

[66] Takuji Nishimura,et al. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[67] Jonathan D. Cohen,et al. Recursive hashing functions for n-grams , 1997, TOIS.

[68] Michael Kolonko,et al. Sequential reservoir sampling with a nonuniform distribution , 2006, TOMS.

[69] 위영철,et al. Data compression apparatus and method , 2007 .