Unsupervised ensemble learning for Vietnamese multisyllabic word extraction

Automatic construction of machine-readable dictionary is a basic and challenging issue for non-common language processing. In this paper, we address the unsupervised ensemble learning (UEL) problem and investigate a UEL-based word extraction algorithm to detect multisyllabic words from large-scale Vietnamese text documents. Firstly, we design a syllable-level n-gram gluer to generate many potential multisyllabic words. Secondly, we calculate two straightforward statistical features: word frequency and document frequency, and implement three unsupervised word extractors. Subsequently, the ensembler merges several dictionaries extracted by the extractors to form the final one. Finally, we evaluate the effectiveness of these individual dictionaries and the ensemble one through two dictionary-based Vietnamese word segmentation algorithms. The experimental results show that out UEL-based extraction algorithm is effective, and the two word segmentation algorithms with automatically extracted dictionaries can achieve comparable results.