Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction

Feature design and selection is a crucial aspect when treating terminology extraction as a machine learning classification problem. We designed feature classes which characterize different properties of terms based on distributions, and propose a new feature class for components of term candidates. By using random forests, we infer optimal features which are later used to build decision tree classifiers. We evaluate our method using the ACL RD-TEC dataset. We demonstrate the importance of the novel feature class for downgrading termhood which exploits properties of term components. Furthermore, our classification suggests that the identification of reliable term candidates should be performed successively, rather than just once.

[1]  Thiago Alexandre Salgueiro Pardo,et al.  A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set , 2013, HLT-NAACL.

[2]  Gunn Inger Lyse,et al.  Collocations and statistical analysis of n-grams: Multiword expressions in newspaper text , 2012 .

[3]  Iryna Gurevych,et al.  Counting What Counts: Decompounding for Keyphrase Extraction , 2015, ACL 2015.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Udo Hahn,et al.  You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction , 2006, ACL.

[7]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[8]  Peng Jiang,et al.  Domain-specific term extraction from free texts , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[9]  Diana Maynard,et al.  NLP Techniques for Term Extraction and Ontology Population , 2008, Ontology Learning and Population.

[10]  Siegfried Handschuh,et al.  The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics , 2014 .

[11]  Ulrich Heid,et al.  Evaluating Noise Reduction Strategies for Terminology Extraction , 2015, TIA.

[12]  Jan Snajder,et al.  Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian , 2012, LREC.

[13]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[14]  Gisle Andersen,et al.  Exploring newspaper language : using the web to create and investigate a large corpus of modern Norwegian , 2012 .

[15]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[16]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[19]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[20]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[21]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[22]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[23]  Hiroshi Nakagawa,et al.  Automatic term recognition based on statistics of compound nouns and their components , 2003 .

[24]  Magnus Merkel,et al.  Using machine learning to perform automatic term recognition , 2010 .

[25]  Jason W. Tilley A Comparison of Statistical Filtering Methods for Automatic Term Extraction for Domain Analysis , 2008 .

[26]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[27]  Jianchu Kang,et al.  A comparative study on unsupervised feature selection methods for text clustering , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.