Technical terminology: some linguistic properties and an algorithm for identification in text

This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase. The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.

[1]  A. Montagu A Comprehensive Dictionary of Psychological and Psychoanalytical Terms , 1959 .

[2]  Roger N. Shepard,et al.  Multidimensional scaling : theory and applications in the behavioral sciences , 1974 .

[3]  D. E. Breedlove,et al.  General Principles of Classification and Nomenclature in Folk Biology , 1973 .

[4]  R. Shepard,et al.  Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, Volume 1. , 1973 .

[5]  D. Lapedes McGraw-Hill dictionary of physics and mathematics , 1978 .

[6]  Antonio Zamora,et al.  The use of titles for automatic document classification , 1980, J. Am. Soc. Inf. Sci..

[7]  R. Huddleston Introduction to the Grammar of English: Verbs, nouns and adjectives: the boundaries between them , 1984 .

[8]  Stephen R. Ellis,et al.  The Emergence of Zipf's Law: Spontaneous Encoding Optimization by Users of a Command Language , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, Applied Natural Language Processing Conference.

[10]  Gerard Salton,et al.  Syntactic Approaches to Automatic Book Indexing , 1988, ACL.

[11]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[12]  Michael C. McCord,et al.  Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars , 1989, Natural Language and Logic.

[13]  Gerard Salton,et al.  A Simple Syntactic Approach for the Generation of Indexing Phrases , 1990 .

[14]  Sridhar Radhakrishnan,et al.  INDEX: The statistical basis for an automatic conceptual phrase-indexing system , 1990, J. Am. Soc. Inf. Sci..

[15]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[16]  James Pustejovsky,et al.  Lexical Knowledge Representation and Natural Language Processing , 1993, Artif. Intell..

[17]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[18]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[19]  Arthur Nádas,et al.  Binary classification by stochastic neural nets , 1995, IEEE Trans. Neural Networks.

[20]  Martin H. Weik Fiber Optics Standard Dictionary , 1997 .