论文信息 - Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora

Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora

Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the hierarchical agglomerative clustering (HAC) techniq ue to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parall el corpora potentially has several attractive features . Firstly, training samples in another language provide indire ct evidence for a classification or clustering result. Secondly, constraints from both languages may help to elimina te some biased language-specific usages, resulting in classes of better quality. Finally, the alignment between p airs of clustered documents can be used to extract words fr om each language, which may then be used for other applications, as an example in this paper, we utili se these words for term reduction. We explain the findings t hat we obtain from the clustering of a significant paralle l corpus for a low-density and high-density of paired langua e, English and Bulgarian. Preliminary results show tha t the HAC algorithm can effectively cluster bilingual par allel corpora separately and still produce the same extra cted words that best describe these clusters for both En glish and Bulgarian corpora.

Rayner Alfred | Dimitar Kazakov | Mark Bartlett

[1] George Karypis,et al. Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[2] Steffen Staab,et al. Text Clustering Based on Background Knowledge , 2003 .

[3] Patrick Pantel,et al. Document clustering with committees , 2002, SIGIR '02.

[4] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[5] Preslav Nakov. BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian , 1998 .