An Effective Method of Arbitrary Length N-gram Statistics for Chinese Text

N-gram frequency is always an important indicator in corpus-based natural language processing. As the corpus being larger, time and space consumption is always a serious issue. In this paper, we proposed a two-stage inverted index based on bigrams. The proposed index is constructed by two levels of variable length vectors and the two characters of a bigram are hashed to different levels. In order to ensure an efficient indexing, some null units are reserved in the vectors. By this way, unigrams and bigrams can be indexed directly. With the help of the index, the frequency of arbitrary length ngram can be calculated efficiently by intersection operation. The two-stage index has excellent properties: (1) it can be constructed efficiently; (2) the space consumption is significantly reduced with no compression; (3) when n=1 and n=2, time complexity of statistical processing decreases significantly. When n>2, it is more efficient than suffix array in most cases. This method is suitable for Chinese and some other Asian languages.

[1]  Jae-Gil Lee,et al.  n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2005, VLDB.

[2]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3]  Xian-Yi Cheng,et al.  The Recognition Method of Unknown Chinese Words in Fragments Based on Mutual Information , 2010, J. Convergence Inf. Technol..

[4]  Teemu Hirsimäki,et al.  On Compressing N-Gram Language Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Sun Ping,et al.  The Research of Chinese Semantic Similarity Calculation Introduced Punctuations , 2010, J. Convergence Inf. Technol..

[6]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[7]  Emmanuel J. Yannakoudakis,et al.  n-Grams and their implication to natural language understanding , 1990, Pattern Recognit..

[8]  Liu Xiao Time and Space Efficiencies Analysis of Full-Text Index Techniques , 2009 .

[9]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[10]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[11]  Jingli Zhou,et al.  Block-Ranking: Content Similarity Retrieval Based on Data Partition in Network Storage Environment , 2010, J. Digit. Content Technol. its Appl..

[12]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[13]  Bhiksha Raj,et al.  Lossless compression of language model structure and word identifiers , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Pascale Fung,et al.  Statistical Augmentation of a Chinese Machine-Readable Dictionary , 1994, ArXiv.

[17]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[18]  Michael Picheny,et al.  Use of statistical N-gram models in natural language generation for machine translation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.