Word Association Norms, Mutual Information, and Lexicography

The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[3]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[4]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[5]  J. Jenkins,et al.  Word association norms , 1964 .

[6]  L. Lasagna The nature of evidence. , 1972, Triangle; the Sandoz journal of medical science.

[7]  W. A. Martin,et al.  Parsing , 1980, ACL.

[8]  Donald Hindle,et al.  Deterministic Parsing of Syntactic Non-fluencies , 1983, ACL.

[9]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[10]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[12]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .

[13]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, Applied Natural Language Processing Conference.

[14]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[15]  Frank A. Smadja,et al.  Microcoding the Lexicon with Co-occurrence Knowledge , 1989 .

[16]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[17]  B. Boguraev Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.