论文信息 - Ternary Tree Optimalization for n-gram Indexing

Ternary Tree Optimalization for n-gram Indexing

N-gram indexing is used in many practical applications. Spam detection, plagiarism detection or comparison of DNA reads. There are many data structures that can be used for this purpose, each with different characteristics. In this article the ternary search tree data structure is used. One improvement of ternary tree that can save up to 43% of required memory is introduced. In the second part new data structure, named ternary forest, is proposed. Efficiency of ternary forest is tested and compared to ternary search tree and two-level indexing ternary search tree.

Václav Snásel | Jan Platos | Daniel Robenek

[1] Lee Jae-Gil,et al. n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2006 .

[2] Ian H. Witten,et al. Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[3] Václav Snásel,et al. Efficient In-memory Data Structures for n-grams Indexing , 2013, DATESO.

[4] Pavel Rychlý,et al. Detecting Co-Derivative Documents in Large Text Collections , 2008, LREC.

[5] Rada Mihalcea,et al. An Efficient Indexer for Large N-Gram Corpora , 2011, ACL.

[6] Jiri Dvorský,et al. Index-based n-gram extraction from large document collections , 2011, 2011 Sixth International Conference on Digital Information Management.

[7] Hugh E. Williams,et al. Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[8] W. Bruce Croft,et al. Efficient indexing of repeated n-grams , 2011, WSDM '11.