TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment.

We report on TExSIS, a flexible bilingual terminology extraction system that uses a sophisticated chunk-based alignment method for the generation of candidate terms, after which the specificity of the candidate terms is determined by combining several statistical filters. Although the set-up of the architecture is largely language-independent, we present terminology extraction results for four different languages and three language pairs. Gold standard data sets were created for French-Italian, French-English and French-Dutch, which allowed us not only to evaluate precision, which is common practice, but also recall. We compared the TExSIS approach, which takes a multilingual perspective from the start, with the more commonly used approach of first identifying term candidates monolingually and then aligning the source and target terms. A comparison of our system with the LUIZ approach described by Vintar (2010) reveals that TExSIS outperforms LUIZ both for monolingual and bilingual terminology extraction. Our results also clearly show that the precision of the alignment is crucial for the success of the terminology extraction. Furthermore, based on the observation that the precision scores for bilingual terminology extraction outperform those of the monolingual systems, we conclude that multilingual evidence helps to determine unithood in less related languages.

[1]  Johann Gamper,et al.  Corpus-based terminology , 1998 .

[2]  M. Teresa Cabré Castellví,et al.  Theories of terminology. Their description, prescription and explanation , 2003 .

[3]  Horacio Rodríguez,et al.  Evaluation of terms and term extraction systems: A practical approach , 2007 .

[4]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .

[5]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[6]  Djoerd Hiemstra,et al.  Using statistical methods to create a bilingual dictionary , 1996 .

[7]  Walter Daelemans,et al.  A Chunk-Driven Bootstrapping Approach to Extracting Translation Patterns , 2010, CICLing.

[8]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[9]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[10]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[11]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[12]  Lieve Macken,et al.  An Annotation Scheme and Gold Standard for Dutch-English Word Alignment , 2010, LREC.

[13]  Lieve Macken Sub-sentential alignment of translational correspondences , 2010 .

[14]  Heather Fulford Exploring terms and their linguistic environment in text: A domain-independent approach to automated term extraction , 2001 .

[15]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[16]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[17]  Sue Ellen Wright 1.1 Term Selection: The Initial Phase of Terminology Management , 1997 .

[18]  Špela Vintar,et al.  Bilingual term recognition revisited: the bag-of-equivalents term alignment approach and its evaluation , 2010 .

[19]  Jörg Tiedemann Can bilingual word alignment improve monolingual phrasal term extraction , 2001 .

[20]  Béatrice Daille Morphological Rule Induction for Terminology Acquisition , 2000, COLING.

[21]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[22]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[23]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[24]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[25]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[26]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[27]  M. Teresa Cabré Castellví,et al.  Automatic term detection: A review of current systems , 2001 .

[28]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[29]  François Yvon,et al.  The Contribution of Low Frequencies to Multilingual Sub-sentential Alignment: a Differential Associative Approach , 2011 .