论文信息 - Efficient In-memory Data Structures for n-grams Indexing

Efficient In-memory Data Structures for n-grams Indexing

Indexing n-gram phrases from text has many practic al applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures l ike hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deeply described and two performanc e improvements are proposed.

Václav Snásel | Jan Platos | Daniel Robenek

[1] Jae-Gil Lee,et al. n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2005, VLDB.

[2] David E. Siegel. All searches are divided into three parts: string searches using ternary trees , 1998, APL.

[3] Hugh E. Williams,et al. In-memory hash tables for accumulating text vocabularies , 2001, Inf. Process. Lett..

[4] Timothy C. Bell,et al. Selecting a hashing algorithm , 1990, Softw. Pract. Exp..

[5] Rada Mihalcea,et al. An Efficient Indexer for Large N-Gram Corpora , 2011, ACL.

[6] Douglas Comer,et al. Ubiquitous B-Tree , 1979, CSUR.

[7] Pavel Rychlý,et al. Detecting Co-Derivative Documents in Large Text Collections , 2008, LREC.

[8] Jiri Dvorský,et al. Index-based n-gram extraction from large document collections , 2011, 2011 Sixth International Conference on Digital Information Management.

[9] W. Bruce Croft,et al. Efficient indexing of repeated n-grams , 2011, WSDM '11.

[10] Leonard S. Cahen,et al. Educational Testing Service , 1970 .

[11] David E. Siegel. All searches are divided into three parts: string searches using ternary trees , 1999 .