Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation

Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n− 1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass onehash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data.

[1]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[2]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[3]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[4]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[5]  Anssi Klapuri,et al.  Conventional and periodic N-grams in the transcription of drum sequences , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[6]  N. Cercone CNG Method with Weighted Voting , 2004 .

[7]  Kamel Aouiche,et al.  Unasssuming View-Size Estimation Techniques in OLAP , 2007, ArXiv.

[8]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[9]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[10]  Mark Allen Weiss,et al.  Data structures and algorithm analysis in Ada , 1993 .

[11]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[12]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[13]  George Karypis,et al.  Selective Markov models for predicting Web page accesses , 2004, TOIT.

[14]  Jae-Gil Lee,et al.  n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2005, VLDB.

[15]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[16]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[17]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[18]  Andrew Rau-Chaplin,et al.  The cgmCUBE project: Optimizing parallel data cube generation for ROLAP , 2006, Distributed and Parallel Databases.

[19]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[20]  Toby J. Teorey,et al.  A Pareto Model for OLAP View Size Estimation , 2001, Inf. Syst. Frontiers.

[21]  Min Zhang,et al.  Improving Language Model Size Reduction using Better Pruning Criteria , 2002, ACL.

[22]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[23]  Claude E. Shannon,et al.  A Mathematical Theory of Communications , 1948 .

[24]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[25]  Dan Sullivan,et al.  Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales , 2001 .

[26]  George Marsaglia,et al.  Toward a universal random number generator , 1987 .

[27]  Matteo Golfarelli,et al.  On Estimating the Cardinality of Aggregate Views , 2001, DMDW.

[28]  F. James A Review of Pseudorandom Number Generators , 1990 .

[29]  Emmanuel J. Yannakoudakis,et al.  n-Grams and their implication to natural language understanding , 1990, Pattern Recognit..

[30]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[31]  Patrick Brennan,et al.  A Prototype for Authorship Attribution Studies , 2006, Lit. Linguistic Comput..

[32]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[33]  Jinho Lee,et al.  On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[34]  Gaston H. Gonnet,et al.  An Analysis of the Karp-Rabin String Matching Algorithm , 1990, Inf. Process. Lett..

[35]  Matteo Golfarelli,et al.  Bounding the cardinality of aggregate views through domain-derived constraints , 2003, Data Knowl. Eng..

[36]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[37]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[38]  Yorick Wilks,et al.  The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora , 2002 .

[39]  Qiang Yang,et al.  WhatNext: a prediction system for Web requests using n-gram sequence models , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[40]  Stefan M. Rüger,et al.  Position Indexing of Adjacent and Concurrent N-Grams for Polyphonic Music Retrieval , 2003, ISMIR.

[41]  R. P. Jagadeesh Chandra Bose,et al.  Data mining approaches to software fault diagnosis , 2005, 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05).

[42]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[43]  Robert A. Stryk Uniform random number generator , 1976, SIML.

[44]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[45]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 1999, Softw. Pract. Exp..

[46]  Michel Benard Àjuste titre: a Lexicometric Approach to the Study of Titles , 1995 .

[47]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[48]  Owen Kaser,et al.  Analyzing Large Collections of Electronic Text Using OLAP , 2006, ArXiv.

[49]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[50]  Douglas W. Oard,et al.  Textual Data Mining to Support Science and Technology Management , 2000, Journal of Intelligent Information Systems.

[51]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[52]  Timo Niemi,et al.  Multidimensional Data Model and Query Language for Informetrics , 2003, J. Assoc. Inf. Sci. Technol..

[53]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[54]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[55]  Amita Goyal Chin,et al.  Text databases & document management: theory & practice , 2001 .

[56]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[57]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[58]  Paul R. Cohen,et al.  Unsupervised segmentation of categorical time series into episodes , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[59]  Michael Droettboom Correcting broken characters in the recognition of historical printed documents , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[60]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[61]  Xiaohui Yu,et al.  Towards estimating the number of distinct value combinations for a set of attributes , 2005, CIKM '05.

[62]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[63]  Panos M. Pardalos,et al.  Handbook of Massive Data Sets , 2002, Massive Computing.

[64]  Bernard Dousset,et al.  DocCube: Multi-dimensional visualisation and exploration of large document sets , 2003, J. Assoc. Inf. Sci. Technol..

[65]  Owen Kaser,et al.  The LitOLAP Project: Data Warehousing with Literature , 2006 .

[66]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[67]  Jonathan D. Cohen,et al.  Recursive hashing functions for n-grams , 1997, TOIS.

[68]  Michael Kolonko,et al.  Sequential reservoir sampling with a nonuniform distribution , 2006, TOMS.

[69]  위영철,et al.  Data compression apparatus and method , 2007 .