In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora

Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation.

[1]  Ulrich Heid,et al.  Reference Lists for the Evaluation of Term Extraction Tools , 2012, TKE 2012.

[2]  Despoina Panou Equivalence in Translation Theories: A Critical Evaluation , 2013 .

[3]  Veronika Papyrina,et al.  The Trade-off Between Quantity and Quality of Information in Gender Responses to Advertising , 2019 .

[4]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[5]  Fatiha Sadat,et al.  Automatic extraction of specialized verbal units: A comparative study on Arabic, English and French , 2017 .

[6]  Rosa Estopà Bagot,et al.  Les unités de signification spécialisées élargissant l’objet du travail en terminologie , 2001 .

[7]  Béatrice Daille Building Bilingual Terminologies from Comparable Corpora: The TTC TermSuite , 2012 .

[8]  Adeline Nazarenko,et al.  Evaluating Term Extraction , 2009, RANLP.

[9]  Behrang Q. Zadeh,et al.  The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods , 2016, LREC.

[10]  Estelle Delpech Evaluation of terminologies acquired from comparable corpora: an application perspective , 2011, NODALIDA.

[11]  E. Milios,et al.  A Comparison of Keyword-and Keyterm-based Methods for Automatic Web Site Summarization , 2004 .

[12]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[13]  Christian Jacquemin,et al.  Term Extraction and Automatic Indexing , 2005 .

[14]  Goran Nenadic,et al.  Mining semantically related terms from biomedical literature , 2006, TALIP.

[15]  Diana Inkpen,et al.  Local-Global Vectors to Improve Unigram Terminology Extraction , 2016 .

[16]  Sampo Pyysalo,et al.  BioNLP Shared Task 2011: Supporting Resources , 2011, BioNLP@ACL.

[17]  Juan C. Sager,et al.  Terminology: Theory, methods and applications , 1999 .

[18]  Claire Lemaire,et al.  Extraction of Domain-Specific Bilingual Lexicon from Comparable Corpora: Compositional Translation and Ranking , 2012, COLING.

[19]  Nikita Astrakhantsev,et al.  ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala , 2016, Lang. Resour. Evaluation.

[20]  Els Lefever,et al.  The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction , 2019 .

[21]  Rogelio Nazar Distributional analysis applied to terminology extraction: First results in the domain of psychiatry in Spanish. , 2016 .

[22]  Gabriel Bernier-Colborne,et al.  Creating a test corpus for term extractors through term annotation. , 2014 .

[23]  Kyo Kageura,et al.  Evaluation of the Term Recognition Task , 1999, NTCIR Conference on Evaluation of Information Access Technologies.

[24]  Véronique Hoste,et al.  A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents , 2018, LREC.

[25]  Igor Leturia,et al.  Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making , 2013, Building and Using Comparable Corpora.

[26]  Emmanuel Morin,et al.  Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora , 2016, COLING.

[27]  Els Lefever,et al.  TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment. , 2013 .

[28]  P. Langlais Corpus-Based Terminology Extraction , 2005 .

[29]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[30]  Widad Mustafa El Hadi,et al.  EVALDA-CESART Project: Terminological Resources Acquisition Tools Evaluation Campaign , 2004, LREC.

[31]  Lawrence Hunter,et al.  An Overview of the CRAFT Concept Annotation Guidelines , 2010, Linguistic Annotation Workshop.

[32]  José Camacho-Collados,et al.  Mokhtar Billami, José Camacho-Collados, Evelyne Jacquey, Laurence Kister. Annotation sémantique et validation terminologique en texte intégral en SHS . Taln'14 , Jul 2014, Marseille, France. 2014. , 2014 .

[33]  Khalid Choukri,et al.  Terminological Resources Acquisition Tools: Toward a User-oriented Evaluation Model , 2006, LREC.

[34]  Špela Vintar,et al.  Bilingual term recognition revisited: the bag-of-equivalents term alignment approach and its evaluation , 2010 .

[35]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .

[36]  J. Humbley,et al.  The Emotional Dimension in Terminological Variation: The Example of Transitivization of the Locative Complement in Fishing , 2020 .

[37]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[38]  Ulrich Heid,et al.  Creating a gold standard corpus for terminological annotation from online forum data , 2017 .

[39]  Véronique Hoste,et al.  All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch , 2016, Computational Linguistics.

[40]  Emmanuel Morin,et al.  Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction , 2014, ACL.

[41]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[42]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[43]  Nikita Astrakhantsev,et al.  Methods for automatic term recognition in domain-specific text collections: A survey , 2015, Programming and Computer Software.

[44]  Zheng Chen,et al.  Domain-independent term extraction & term network for scientific publications , 2017 .

[45]  Jennifer Pearson,et al.  Terms in context , 1998 .

[46]  Natalia V. Loukachevitch,et al.  Multiple Evidence for Term Extraction in Broad Domains , 2011, RANLP.

[47]  Fabio Massimo Zanzotto,et al.  Terminology Extraction: An Analysis of Linguistic and Statistical Approaches , 2005 .

[48]  Jie Gao,et al.  SemRe-Rank , 2018, ACM Trans. Knowl. Discov. Data.

[49]  Paul Buitelaar,et al.  Domain-independent term extraction through domain modelling , 2013 .

[50]  Evelyne Jacquey,et al.  Annotation sémantique et validation terminologique en texte intégral en SHS , 2014 .

[51]  Michèle Sebag,et al.  Preference Learning in Terminology Extraction: A ROC-based approach , 2005, ArXiv.

[52]  Christian Federmann,et al.  From Statistical Term Extraction to Hybrid Machine Translation , 2011, EAMT.

[53]  Wei Liu,et al.  Determination of Unithood and Termhood for Term Recognition , 2009 .

[54]  Udo Hahn,et al.  Finding new terminology in very large corpora , 2005, K-CAP '05.

[55]  Ulrich Heid,et al.  Terminology Extraction, Translation Tools and Comparable Corpora: TTC concept, midterm progress and achieved results , 2012 .

[56]  Goran Nenadic,et al.  Enhancing automatic term recognition through recognition of variation , 2004, COLING.

[57]  Patrick Drouin,et al.  Term extraction using non-technical corpora as a point of leverage , 2003 .

[58]  Rosa Estopà Les unités de signification spécialisées élargissant l'objet du travail en terminologie , 2001 .

[59]  Pamela Faber,et al.  A Cognitive Linguistics View of Terminology and Specialized Language , 2012 .

[60]  Diana Inkpen,et al.  Term Evaluator: A Tool for Terminology Annotation and Evaluation , 2016, Int. J. Comput. Linguistics Appl..

[61]  Thiago Alexandre Salgueiro Pardo,et al.  A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set , 2013, HLT-NAACL.

[62]  Ahmet Aker,et al.  Extracting bilingual terminologies from comparable corpora , 2013, ACL.

[63]  Kyo Kageura,et al.  Computational terminology and filtering of terminological information , 2018, Terminology.

[64]  A. Condamines Chapter 1. The emotional dimension in terminological variation: The example of transitivization of the locative complement in fishing , 2017 .

[65]  Marie-Claude L'Homme,et al.  Definition of an evaluation grid for term-extraction software , 1996 .

[66]  Nikita Astrakhantsev,et al.  Automatic recognition of domain-specific terms: an experimental evaluation , 2013, SYRCoDIS.

[67]  Els Lefever,et al.  LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit , 2013, CLIN 2013.

[68]  Georgios Kontonatsios,et al.  Automatic compilation of bilingual terminologies from comparable corpora , 2015 .

[69]  Andy Way,et al.  TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction , 2018, Lang. Resour. Evaluation.

[70]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .

[72]  Véronique Anne Sauron Tearing out the terms: evaluating terms extractors , 2002, TC.

[73]  Emmanuel Morin,et al.  Improving Bilingual Terminology Extraction from Comparable Corpora via Multiple Word-Space Models , 2016, LREC.

[74]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[75]  Hinrich Schütze,et al.  Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents , 2014, COLING.

[76]  Sabine Schulte im Walde,et al.  Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction , 2017, EACL.

[77]  Kara Warburton,et al.  Processing terminology for the translation pipeline. , 2013 .

[78]  Jorge Vivaldi Palatresi,et al.  Evaluation of terms and term extraction systems: a practical approach , 2007 .

[79]  Marie-Claude L'Homme,et al.  Lexical Profiling of Environmental Corpora , 2018, LREC.

[80]  Natalia Grabar The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics , 2018 .

[81]  Sabine Schulte im Walde,et al.  A Laypeople Study on Terminology Identification across Domains and Task Definitions , 2018, NAACL-HLT.

[82]  Els Lefever,et al.  Dutch hypernym detection: does decompounding help? , 2016 .

[83]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.