Efficient In-memory Data Structures for n-grams Indexing

Indexing n-gram phrases from text has many practic al applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures l ike hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deeply described and two performanc e improvements are proposed.