Term acquisition : a text-probing approach

In order to assist terminologists in the compilation of terminology collections in specialist domains, a "text probing" approach to the acquisition of English terms from special language texts is specified, designed, implemented, and evaluated. This approach draws on aspects of general language corpus linguistics and computational lexicography, and follows current trends towards corpus-based terminology compilation work. Our text-probing approach is founded specifically on observations about the linguistic features of English terms and their collocational behaviour in special language texts, and represents an effort to extend the scope of existing collocation studies from general language to special language. It aims to be both domain- and text-type independent. By operating on the premise that a term is likely to reside in a special language text between boundary markers comprising closed class words/punctuation, it permits the acquisition of single- and multi-word terms spanning a range of word classes. Our approach has been implemented in a prototype computer program ("Termspotter") which has been written in Quintus Prolog. This program processes untagged special language texts, either individually or in batches. It functions by "probing" texts for closed class words and punctuation, extracting as term candidates those items which reside between them. A systematic evaluation of the text-probing approach is presented in which, using an innovative experimental design, the term acquisition efficiency of Termspotter is measured against the manual scanning output of domain experts, as well as compared with the scanning output of terminologists. Results in the special language texts studied so far indicate that, on average, Termspotter can accurately retrieve 80% of the terms identified by a domain expert, and can typically partially retrieve the remaining 20%. The program performed very favourably in comparison with human terminologists. Extensions of our text- probing approach to other languages are anticipated. Moreover, wider applications of the notion of text probing are envisaged, both within and beyond the terminology community, for abstracting other structures from special language texts.

[1]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[2]  Christer Laurén,et al.  Special language : from humans thinking to thinking machines , 1989 .

[3]  Mona Baker,et al.  Text and technology : in honour of John Sinclair , 1993 .

[4]  Sue Ellen Wright,et al.  Handbook of Terminology Management: Volume 1: Basic Aspects of Terminology Management , 1997 .

[5]  Juan C Sager,et al.  English Special Languages: Principles and Practice in Science and Technology , 1980 .

[6]  Jeremy Clear,et al.  From Firth Principles — Computational Tools for the Study of Collocation , 1993 .

[7]  Anne Condamines Terminology: New needs, new perspectives , 1995 .

[8]  Ingrid Meyer,et al.  The corpus from a terminographer's viewpoint , 1996 .

[9]  Sue Ellen Wright,et al.  Standardizing Terminology for Better Communication: Practice, Applied Theory, and Results , 1993 .

[10]  Christopher S. Butler,et al.  Computers and written texts , 1992 .

[11]  Juan C. Sager,et al.  A practical course in terminology processing , 1990 .

[12]  Khurshid Ahmad,et al.  Pragmatics of Specialist Terms: The Acquisition and Representation of Terminology , 1993, EAMT Workshop.

[13]  Robert Dubuc,et al.  Manuel Pratique De Terminologie , 1992 .

[14]  Sue Ellen Wright 1.1 Term Selection: The Initial Phase of Terminology Management , 1997 .

[15]  Walter A. Sedelow,et al.  Computers in Language Research 2 , 1983 .

[16]  Willem Meijs,et al.  Book Reviews: Theory and Practice in Corpus Linguistics , 1991, CL.

[17]  C Snow,et al.  Child language data exchange system , 1984, Journal of Child Language.

[18]  Lynne Bowker Applied terminology: a state of the art report , 1994 .

[19]  Frank Srnadja Lexical Co-occurrence: The Missing Link , 1989 .

[20]  Cg Interrante,et al.  Standardization of Technical Terminology: Principles and Practices , 1983 .

[21]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[22]  Khurshid Ahmad,et al.  Terminology and Knowledge Acquisition: A Text-Based Approach , 1993, Terminology and Knowledge Engineering.

[23]  Heather Fulford,et al.  The Translator's Workbench Project 1 , 1989 .

[24]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .

[25]  Christian Galinski,et al.  Special Languages, Terminology Planning and Standardization , 1988 .

[26]  D. Bourigault Lexter : un Logiciel d'EXtraction de TERminologie : application à l'acquisition des connaissances à partir de textes , 1994 .

[27]  H Felber Basic Principles and Methods for the Preparation of Terminology Standards , 1983 .

[28]  Lynne Bowker TOWARDS A CORPUS-BASED APPROACH TO TERMINOGRAPHY , 1996 .

[29]  Bo Svensén,et al.  Practical Lexicography: Principles and Methods of Dictionary-Making , 1993 .

[30]  D. Biber,et al.  Corpus-based Approaches to Issues in Applied Linguistics , 1994 .

[31]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[32]  G. Otman Des ambitions et des performances d'un système de dépouillement terminologique assisté par ordinateur , 1991 .

[33]  Reiner Arntz II. Terminological Equivalence and Translation , 1993 .

[34]  D ColeWayne Terminology: principles and methods , 1987 .

[35]  S. Jones,et al.  English lexical collocations - A study in computational linguistics , 1974 .

[36]  J. Firth Papers in linguistics , 1958 .

[37]  Angela Tadros Prediction in text , 1985 .

[38]  Helmi B. Sonneveld,et al.  Terminology : applications in interdisciplinary communication , 1993 .

[39]  Jeremy M. R. Martin,et al.  The Oxford Concordance Program Version 2 , 1987 .

[40]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[41]  Ra Strehlow Standardization of Technical Terminology: Principles and Practices (Second Volume) , 1988 .

[42]  Petra Steffens,et al.  Machine Translation and the Lexicon , 1993, Lecture Notes in Computer Science.

[43]  Chris Mellish,et al.  Natural Language Processing in PROLOG , 1989 .

[44]  Andy Lauriston Automatic recognition of complex terms: Problems and the TERMINO solution , 1994 .

[45]  Mary Snell-Hornby,et al.  Translation studies : an interdiscipline , 1994 .

[46]  C. Chapelle The Computational Analysis of English—A Corpus‐Based Approach , 1988 .

[47]  Zellig S. Harris,et al.  Language and information , 1988 .

[48]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[49]  Z. Harris A Theory of Language and Information: A Mathematical Approach , 1991 .

[50]  K. Kageura Toward the theoretical study of terms: A sketch from the linguistic viewpoint , 1995 .

[51]  G. Prideaux Psycholinguistics: The Experimental Study of Language , 1984 .

[52]  1.2.1 Term Formation , 1997 .

[53]  Khurshid Ahmad,et al.  Assembling and Viewing a Corpus of Texts: Self-organisation, Logical Deduction and Spreading Activation as Metaphors , 1996 .

[54]  Eugen Wüster,et al.  The machine tool: an interlingual dictionary of basic concepts : comprising an alphabetical dictionary and a classified vocabulary with definitions and illustrations , 1968 .

[55]  John Sinclair,et al.  The automatic analysis of corpora , 1992 .

[56]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[57]  Y Huizhong,et al.  A new technique for identifying scientific/technical terms and describing science texts , 1986 .

[58]  EUGEN WÜSTER,et al.  DIE ALLGEMEINE TERMINOLOGIELEHRE – EIN GRENZGEBIET ZWISCHEN SPRACHWISSENSCHAFT, LOGIK, ONTOLOGIE, INFORMATIK UND DEN SACHWISSENSCHAFTEN , 1974 .

[59]  Ingrid Meyer,et al.  Refining the terminographer's concept-analysis methods: How can phraseology help? , 1996 .

[60]  Blaise Nkwenti-Azeh Positional and combinational characteristics of terms: consequences for corpus-based terminography , 1994 .

[61]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[62]  Barbara Ann Kipfer Workbook on lexicography : a course for dictionary users with a glossary of English lexicographical terms , 1984 .

[63]  Heather Fulford,et al.  What is a term?: The semi-automatic extraction of terms from text , 1994 .

[64]  C. Mair,et al.  Using large corpora , 1997 .

[65]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[66]  H. Picht,et al.  Terminology : an introduction , 1985 .