Extracting Corpus-Specific Strings by Using Suffix Arrays Enhanced with Longest Common Prefix

We propose a new term extraction algorithm that considers all of the substrings as term candidates. Our algorithm uses a suffix array as the data structure that emulates the suffix tree of the corpus. We use two scoring functions, one of which is used to detect good substring boundaries as linguistic chunks and the other is to find domain-specific phrases and combine them with a re-ranking approach. Experiments show that the proposed all-substring term extraction algorithm shows good performance for highly-frequent terms compared with the baseline algorithm that uses a morphological analyzer in the preprocessing step.