Supervised learning for robust term extraction

We propose a machine learning method to automatically classify the extracted ngrams from a corpus into terms and non-terms. We use 10 common statistics in previous term extraction literature as features for training. The proposed method, applicable to term recognition in multiple domains and languages, can help 1) avoid the laborious work in the post-processing (e.g. subjective threshold setting); 2) handle the skewness and demonstrate noticeable resilience to domain-shift issue of training data. Experiments are carried out on 6 corpora of multiple domains and languages, including GENIA and ACLRD-TEC(1.0) corpus as training set and four TTC subcorpora of wind energy and mobile technology in both Chinese and English as test set. Promising results are found, which indicate that this approach is capable of identifying both single word terms and multiword terms with reasonably good precision and recall.

[1]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[2]  Dan Roth,et al.  Understanding the Value of Features for Coreference Resolution , 2008, EMNLP.

[3]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[4]  Ulrich Heid,et al.  TTC:terminology extraction, translation tools, comparable corpora , 2010, EAMT.

[5]  Su Jian,et al.  Exploring deep knowledge resources in biomedical name recognition , 2004 .

[6]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[7]  Natalia Grabar The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics , 2018 .

[8]  Fabio Massimo Zanzotto,et al.  Terminology Extraction: An Analysis of Linguistic and Statistical Approaches , 2005 .

[9]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[10]  Jie Gao,et al.  JATE 2.0: Java Automatic Term Extraction with Apache Solr , 2016, LREC.

[11]  Magnus Merkel,et al.  Using machine learning to perform automatic term recognition , 2010 .

[12]  Ulrich Heid,et al.  Reference Lists for the Evaluation of Term Extraction Tools , 2012, TKE 2012.

[13]  Yue Zhang,et al.  Feature-Rich Segment-Based News Event Detection on Twitter , 2013, IJCNLP.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Shiwen Yu,et al.  Extracting Terminologically Relevant Collocations in the Translation of Chinese Monograph , 2005, IJCNLP.

[16]  Nikita Astrakhantsev,et al.  Automatic recognition of domain-specific terms: an experimental evaluation , 2013, SYRCoDIS.

[17]  Béatrice Daille,et al.  TTC TermSuite - A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora , 2011, IJCNLP.

[18]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.