Collocation and term extraction using linguistically enhanced statistical methods

The research presented in this thesis substantiates, defines and evaluates two new linguistically motivated statistical association measures in a language- and domain independent manner, limited syntagmatic modifiability (LSM) for collocation extraction, and limited paradigmatic modifiability (LPM) for term extraction. The task they are designed for – computing lexical association scores to determine the degree of collocativity and termhood of collocation and term candidates – is the crucial backbone of any approach to collocation and term extraction and, thus, resembles a wide variety of standard frequency-based, statistical and information-theoretic association measures put forth in the computational linguistics research literature. What distinguishes LSM and LPM is that their defining parameters are based on actual linguistic properties of the targeted linguistic constructions, viz. collocations and terms. The central linguistic property which is isolated in the linguistic research literature and which is shared by collocations and terms is denoted by the notion of limited modifiability. This property is parameterized in such a way as to account for the obvious linguistic differences between collocations and terms in that collocations are typically manifested in general language and surface in a variety of syntactic constructions, while terms are typically confined to noun phrases manifested in domain-specific sub-language. Limited modifiability is embedded within an appropriate linguistic frame of reference – the lexical-collocational layer of Firth (1957)’s contextualist model of language description. With the help of this model, the linguistic differences are realized as limited syntagmatic modifiability, in the case of collocations, and as limited paradigmatic modifiability, in the case of terms. The respective linguistically enhanced lexical association measures exploit these properties as observable and quantifiable parameters to their statistical computations in that LSM incorporates the tendency of collocations to limit the number of potential syntagmatic attachments whereas LPM incorporates the tendency of terms to limit the number of potential paradigmatic substitutions. Frequency of co-occurrence is another prominent linguistic property incorporated into both linguistic association measures and is the only linguistic property also exploited by other standard frequency-based, statistical and information-theoretic association measures for collocation and term extraction. In order to compare the linguistically enhanced lexical association measures LSM and LPM against their standard competitors, a comprehensive performance evaluation setting is established – for collocation extraction on German-language preposition-noun-word collocation candidates and for term extraction on English-language noun phrase term candidates from a biomedical subdomain. In this setting, a wide array of standard quantitative performance metrics is applied as well as, in addition, a new qualitative performance evaluation metric which compares the output rankings of an association measure to the challenging baseline of frequency of co-occurrence. All experimental results show that LSM and LPM outperform the other frequency-based, statistical and information-theoretic lexical association measures by large margins in every aspect of performance evaluation considered. Thus, lexical association measures which base their statistical computations on linguistic parameters instead of standard statistical ones not only exhibit conceptual but also empirical superiority.

[1]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[2]  M. Benson The Structure of the Collocational Dictionary , 1989 .

[3]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[4]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[5]  M. Benson,et al.  Collocations and General-purpose Dictionaries , 1990 .

[6]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[7]  Naomi Sager,et al.  Syntactic formatting of science information , 1972, AFIPS '72 (Fall, part II).

[8]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[9]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[10]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[11]  Udo Hahn,et al.  Collocation Extraction Based on Modifiability Statistics , 2004, COLING.

[12]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[13]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  A. G. Oettinger,et al.  Language and information , 1968 .

[16]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[17]  SmadjaFrank Retrieving collocations from text , 1993 .

[18]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[19]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[20]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[21]  Dekang Lin,et al.  Principle-Based Parsing Without Overgeneration , 1993, ACL.

[22]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[23]  Morton Benson,et al.  Lexicographic description of English , 1986 .

[24]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[25]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[26]  Hiyan Alshawi,et al.  Training and Scaling Preference Functions for Disambiguation , 1994, Comput. Linguistics.

[27]  Andrée Vansteelandt The BBI cominatory dictionary of English. A guide to word combinations , 1995 .

[28]  Michael Grüninger,et al.  Introduction , 2002, CACM.

[29]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[30]  E. Herburger Negative contexts'. Collocation, polarity and multiple negation , 2000 .

[31]  John Lehrberger,et al.  Automatic Translation and the Concept of Sublanguage , 1982 .

[32]  Ronnie Cann,et al.  Formal Semantics: INTENSIONAL SEMANTICS , 1993 .

[33]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[34]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[35]  Kathleen McKeown,et al.  Automatically Extracting and Representing Collocations for Language Generation , 1990, ACL.

[36]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[37]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[38]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[39]  D. Terence Langendoen,et al.  The London school of linguistics , 1968 .

[40]  Naomi Sager,et al.  Chapter 2. Automatic Information Formatting of a Medical Sublanguage , 1982 .

[41]  Henning Bergenholtz,et al.  Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen Beispiels , 1985 .

[42]  Jennifer Pearson,et al.  Terms in context , 1998 .

[43]  M. Teresa Cabré Castellví,et al.  Theories of terminology. Their description, prescription and explanation , 2003 .

[44]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[47]  J. Zwart The Minimalist Program , 1998, Journal of Linguistics.

[48]  Morton Benson,et al.  The BBI dictionary of English word combinations , 1991 .

[49]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[50]  Wolfgang Lezius,et al.  A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German , 1998, ACL.

[51]  Nigel Collier,et al.  Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain , 2001 .

[52]  Udo Hahn,et al.  Effective Grading of Termhood in Biomedical Literature , 2005, AMIA.

[53]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[54]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[55]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[56]  Ralph Grishman,et al.  Analysing language in restricted domains , 1986 .

[57]  M. Halliday Categories of the theory of grammar , 1959 .

[58]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[59]  J. V. D. Auwera,et al.  Routledge studies in Germanic linguistics , 1997 .

[60]  Christian Jacquemin Improving Automatic Indexing through Concept Combination and Term Enrichment , 1998, COLING-ACL.

[61]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[62]  Van Nguyen,et al.  Modular Text Processing System Based on the SPECIALIST Lexicon and Lexical Tools , 1998, AMIA.

[63]  A. Cowie The Treatment of Collocations and Idioms in Learners' Dictionaries , 1981 .

[64]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[65]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[66]  Sandy Lovie Shannon, Claude E , 2005 .

[67]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[68]  Julia E. Hodges,et al.  An automated system that assists in the generation of document indexes , 1996, Nat. Lang. Eng..

[69]  Christine Thielen,et al.  Ein kleines und erweitertes Tagset fürs Deutsche , 1996 .

[70]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[71]  Elisabeth Breidt,et al.  Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German , 1996, VLC@ACL.

[72]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[73]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[74]  Thomas Hill Long Longman dictionary of English idioms , 1979 .

[75]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[76]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[77]  Udo Hahn,et al.  Finding new terminology in very large corpora , 2005, K-CAP '05.

[78]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[79]  R. Kay,et al.  Applied Statistics. A Handbook of Techniques. 5th ed. , 1984 .

[80]  Goran Nenadic,et al.  Enhancing automatic term recognition through recognition of variation , 2004, COLING.

[81]  Sergei Nirenburg,et al.  Automatic Translation and the Concept of Sublanguage , 2003 .

[82]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[83]  Elena Tognini-Bonelli,et al.  Studies in corpus linguistics , 1998 .

[84]  Igor Mel’čuk,et al.  Lexical functions: a tool for the description of lexical relations in a lexicon , 1996 .

[85]  Lillian Le-Cointe,et al.  Applied Statistics: A Handbook of Techniques (2nd ed.) , 1998 .

[86]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[87]  Juan C. Sager,et al.  A practical course in terminology processing , 1990 .

[88]  Betty Kirkpatrick,et al.  NTC's English Idioms Dictionary , 1991 .

[89]  Udo Hahn,et al.  You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction , 2006, ACL.

[90]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[91]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[92]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[93]  Alain Polguère,et al.  Introduction à la lexicologie explicative et combinatoire , 1995 .

[94]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[95]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[96]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[97]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[98]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[99]  Christian Jacquemin,et al.  Term Extraction and Automatic Indexing , 2005 .

[100]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[101]  Gene Ontology Consortium,et al.  The Gene Ontology (GO) project in 2006 , 2005, Nucleic Acids Res..

[102]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[103]  R. Schiffer Psychobiology of Language , 1986 .

[104]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[105]  M. Petró‐Turza,et al.  The International Organization for Standardization. , 2003 .

[106]  Louis Trimble,et al.  English for Science and Technology: A Discourse Approach , 1985 .

[107]  Udo Hahn,et al.  Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms , 2005, HLT.

[108]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[109]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[110]  S. Evert,et al.  Can we do better than frequency ? A case study on extracting PP-verb collocations , 2001 .

[111]  Frank Srnadja Lexical Co-occurrence: The Missing Link , 1989 .

[112]  G. Leech 100 million words of English , 1993, English Today.

[113]  Sabine Bartsch Structural and functional properties of collocations in English : a corpus study of lexical and pragmatic constraints on lexical co-occurrence , 2004 .

[114]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[115]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[116]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .