Termhood-Based Comparability Metrics of Comparable Corpus in Special Domain

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics.

[1]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[2]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[3]  Adam Kilgarriff,et al.  Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[4]  Tony McEnery,et al.  Chapter 2. Parallel and Comparable Corpora: What is Happening? , 2007 .

[5]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[6]  Pascale Fung,et al.  A statistical view on bilingual lexicon extraction , 1998, AMTA.

[7]  Éric Gaussier,et al.  Bilingual terminology extraction : an approach based on a multilingual thesaurus applicable to comparable corpora , 2002 .

[8]  Chunyu Kit,et al.  Measuring mono-word termhood by rank difference via corpus comparison , 2008 .

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  Igor Leturia,et al.  Search engine based approaches for collecting domain-specific Basque-English comparable corpora from the Internet , 2009 .

[11]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[12]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[13]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[14]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.