The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics

This paper introduces ACL RD-TEC: a dataset for evaluating the extraction and classification of terms from literature in the domain of computational linguistics. The dataset is derived from the Association for Computational Linguistics anthology reference corpus (ACL ARC). In its first release, the ACL RD-TEC consists of automatically segmented, part-of-speech-tagged ACL ARC documents, three lists of candidate terms, and more than 82,000 manually annotated terms. The annotated terms are marked as either valid or invalid, and valid terms are further classified as technology and non-technology terms. Technology terms signify methods, algorithms, and solutions in computational linguistics. The paper describes the dataset and reports the relevant statistics. We hope the step described in this paper encourages a collaborative effort towards building a full-fledged annotated corpus from the computational linguistics literature.

[1]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[2]  علي ا لحسن المصطلح و الترجمة = Terminology and Translation , 2014 .

[3]  Behrang Q. Zadeh,et al.  Investigating Context Parameters in Technology Term Recognition , 2014, COLING 2014.

[4]  Juan C. Sager,et al.  Terminology: Theory, methods and applications , 1999 .

[5]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[6]  Warren Harrison Eating Your Own Dog Food , 2006, IEEE Softw..

[7]  Ben Bradford,et al.  Oxford Handbooks Online , 2015 .

[8]  A. Campo The reception of Eugen Wüster’s work and the development of terminology , 2013 .

[9]  Khurshid Ahmad,et al.  The head-modifier principle and multilingual term extraction , 2005, Natural Language Engineering.

[10]  Adeline Nazarenko,et al.  Evaluating Term Extraction , 2009, RANLP.

[11]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[12]  Andrew Feenberg,et al.  Philosophy of technology , 2015 .

[13]  EUGEN WÜSTER,et al.  DIE ALLGEMEINE TERMINOLOGIELEHRE – EIN GRENZGEBIET ZWISCHEN SPRACHWISSENSCHAFT, LOGIK, ONTOLOGIE, INFORMATIK UND DEN SACHWISSENSCHAFTEN , 1974 .

[14]  Gambier,et al.  Handbook of Translation Studies : Volume 1 , 2010 .

[15]  P. Kleingeld,et al.  The Stanford Encyclopedia of Philosophy , 2013 .

[16]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[17]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[18]  Hiroshi Nakagawa Automatic term recognition based on statistics of compound nouns , 2000 .

[19]  Pamela Faber,et al.  A Cognitive Linguistics View of Terminology and Specialized Language , 2012 .

[20]  Evelyne Tzoukermann,et al.  NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax , 1999 .

[21]  Ulrich Heid,et al.  Term candidate extraction for terminography and CAT: an overview of TTC , 2012 .

[22]  Simonetta Montemagni,et al.  A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora , 2010, LREC.

[23]  Juan C. Sager,et al.  A practical course in terminology processing , 1990 .

[24]  Nikola Ljubešić,et al.  Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages , 2012 .

[25]  Thierry Hamon,et al.  Improving Term Extraction with Terminological Resources , 2006, FinTAL.

[26]  Hans Wortmann,et al.  Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets , 2010, BIS.

[27]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[28]  Fidelia Ibekwe-SanJuan,et al.  Introduction: application-driven terminology engineering , 2005 .

[29]  Gabriel Bernier-Colborne,et al.  Creating a test corpus for term extractors through term annotation. , 2014 .

[30]  Horacio Rodríguez,et al.  Evaluation of terms and term extraction systems: A practical approach , 2007 .

[31]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[32]  Dragomir R. Radev,et al.  The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics , 2008, LREC.

[33]  D. Maynard Term recognition using combined knowledge sources , 1999 .

[34]  Goran Nenadic,et al.  Mining methodologies from NLP publications: A case study in automatic terminology recognition , 2012, Comput. Speech Lang..

[35]  Mark Dredze,et al.  Entity Linking: Finding Extracted Entities in a Knowledge Base , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[36]  Marie-Claude L'Homme,et al.  Terminologies and Taxonomies , 2015 .

[37]  M. Teresa Cabré Castellví,et al.  Theories of terminology. Their description, prescription and explanation , 2003 .

[38]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[39]  Patrick Drouin,et al.  Term extraction using non-technical corpora as a point of leverage , 2003 .

[40]  Iryna Gurevych,et al.  Mining Multiword Terms from Wikipedia , 2012 .

[41]  Beatrice Daille,et al.  Combined approach for terminology extraction: lexical statistics and linguistic filtering , 1995 .

[42]  Natalia Grabar,et al.  Automatic Acquisition of Morphological Knowledge for Medical Language Processing , 1999, AIMDM.

[43]  Behrang Q. Zadeh,et al.  Evaluation of Technology Term Recognition with Random Indexing , 2014, LREC.

[44]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[45]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[46]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[47]  Y. Gambier,et al.  Handbook of Translation Studies , 2021, Handbook of Translation Studies.

[48]  Kyo Kageura,et al.  On the Study of Dynamics of Terminology: a Proposal of a Theoretical Framework ($%+4+ *&3.06-,4!"'#)531/2 , 1999 .

[49]  Sophia Ananiadou,et al.  The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms , 1998, ECDL.

[50]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.