Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora

Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the hierarchical agglomerative clustering (HAC) techniq ue to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parall el corpora potentially has several attractive features . Firstly, training samples in another language provide indire ct evidence for a classification or clustering result. Secondly, constraints from both languages may help to elimina te some biased language-specific usages, resulting in classes of better quality. Finally, the alignment between p airs of clustered documents can be used to extract words fr om each language, which may then be used for other applications, as an example in this paper, we utili se these words for term reduction. We explain the findings t hat we obtain from the clustering of a significant paralle l corpus for a low-density and high-density of paired langua e, English and Bulgarian. Preliminary results show tha t the HAC algorithm can effectively cluster bilingual par allel corpora separately and still produce the same extra cted words that best describe these clusters for both En glish and Bulgarian corpora.