论文信息 - Extracting Corpus-Specific Strings by Using Suffix Arrays Enhanced with Longest Common Prefix

Extracting Corpus-Specific Strings by Using Suffix Arrays Enhanced with Longest Common Prefix

We propose a new term extraction algorithm that considers all of the substrings as term candidates. Our algorithm uses a suffix array as the data structure that emulates the suffix tree of the corpus. We use two scoring functions, one of which is used to detect good substring boundaries as linguistic chunks and the other is to find domain-specific phrases and combine them with a re-ranking approach. Experiments show that the proposed all-substring term extraction algorithm shows good performance for highly-frequent terms compared with the baseline algorithm that uses a morphological analyzer in the preprocessing step.

[1] Thiago Alexandre Salgueiro Pardo,et al. A survey of automatic term extraction for Brazilian Portuguese , 2013, Journal of the Brazilian Computer Society.

[2] Soumen Chakrabarti,et al. Mining the web - discovering knowledge from hypertext data , 2002 .

[3] Hara Kostakis,et al. A new method for activity-based modelling of customer profitability analysis in hotels , 2011, Int. J. Adv. Intell. Paradigms.

[4] Hiroshi Nakagawa,et al. A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[5] Jun'ichi Tsujii,et al. Text Categorization with All Substring Features , 2009, SDM.

[6] Fuji Ren,et al. Analysis of Wakamono Kotoba Emotion Corpus and Its Application in Emotion Estimation , 2011 .

[7] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[8] Kumiko Tanaka-Ishii,et al. A multilingual usage consultation tool based on internet searching: more than a search engine, less than QA , 2005, WWW '05.

[9] Kyo Kageura,et al. METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[10] Yugo Murawaki,et al. Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints , 2008, EMNLP.

[11] Gerhard Weikum,et al. Fast logistic regression for text categorization with variable-length n-grams , 2008, KDD.

[12] Peter Weiner,et al. Linear Pattern Matching Algorithms , 1973, SWAT.