论文信息 - Unsupervised ensemble learning for Vietnamese multisyllabic word extraction

Unsupervised ensemble learning for Vietnamese multisyllabic word extraction

Automatic construction of machine-readable dictionary is a basic and challenging issue for non-common language processing. In this paper, we address the unsupervised ensemble learning (UEL) problem and investigate a UEL-based word extraction algorithm to detect multisyllabic words from large-scale Vietnamese text documents. Firstly, we design a syllable-level n-gram gluer to generate many potential multisyllabic words. Secondly, we calculate two straightforward statistical features: word frequency and document frequency, and implement three unsupervised word extractors. Subsequently, the ensembler merges several dictionaries extracted by the extractors to form the final one. Finally, we evaluate the effectiveness of these individual dictionaries and the ensemble one through two dictionary-based Vietnamese word segmentation algorithms. The experimental results show that out UEL-based extraction algorithm is effective, and the two word segmentation algorithms with automatically extracted dictionaries can achieve comparable results.

Wuying Liu | Lin Wang | Wuying Liu | Lin Wang

[1] Efstathios Stamatatos,et al. Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[2] Li Lin,et al. Probabilistic ensemble learning for vietnamese word segmentation , 2014, SIGIR.

[3] Andreas Vlachos. Evaluating unsupervised learning for natural language processing tasks , 2011, ULNLP@EMNLP.

[4] 2016 International Conference on Asian Language Processing, IALP 2016, Tainan, Taiwan, November 21-23, 2016 , 2016, IALP.

[5] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[6] Richard Sproat,et al. The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[7] Wuying Liu,et al. Power Law for Text Categorization , 2013, CCL.

[8] Mathias Rossignol,et al. Word Segmentation of Vietnamese Texts: a Comparison of Approaches , 2008, LREC.

[9] Viet-Hung Dang,et al. Recognizing and Tagging Vietnamese Words Based on Statistics and Word Order Patterns , 2013, Advanced Methods for Computational Collective Intelligence.

[10] Wuying Liu,et al. How does Dictionary Size Influence Performance of Vietnamese Word Segmentation? , 2016, LREC.

[11] Ting Wang,et al. Online active multi-field learning for efficient email spam filtering , 2011, Knowledge and Information Systems.