A Comparative Evaluation of Term Recognition Algorithms

Automatic Term Recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.

[1]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[2]  Udo Hahn,et al.  Finding new terminology in very large corpora , 2005, K-CAP '05.

[3]  Dan Klein,et al.  Combining Heterogeneous Classifiers for Word Sense Disambiguation , 2001, SENSEVAL@ACL.

[4]  David A. Evans,et al.  Clarit-TREC Experiments , 1995, Inf. Process. Manag..

[5]  Rada Mihalcea,et al.  Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity , 2007, International Conference on Semantic Computing (ICSC 2007).

[6]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities , 2007, IESA.

[7]  Manuel Palomar,et al.  A Maximum Entropy-based Word Sense Disambiguation System , 2002, COLING.

[8]  Ziqi Zhang,et al.  Dynamic iterative ontology learning , 2007 .

[9]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[10]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[11]  Eugene Charniak,et al.  Determining the specificity of nouns from text , 1999, EMNLP.

[12]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[13]  Aba-Sah Dadzie,et al.  Doris: Managing Document-based Knowledge in Large Organisations via Semantic Web Technologies , 2007, Semantic Web Challenge.

[14]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[15]  Yurdaer N. Doganata,et al.  Glossary extraction and utilization in the information search and delivery system for IBM Technical Support , 2004, IBM Syst. J..

[16]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[17]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[18]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[19]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[20]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[21]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[22]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[23]  Youngja Park,et al.  Towards Ontologies On Demand , 2003 .

[24]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .