Defining collocation for Slovenian lexical resources

In this paper, we define the notion of collocation for the purpose of its use in machine-readable language resources, which will be used in the creation of electronic dictionaries and language applications for Slovene. Based on theoretical and lexicographically-driven studies we define collocation as a lexical phenomenon, defined by three key aspects: statistical, syntactic, and semantic. We take lexicographic relevance as a point of departure for defining collocations within the typology of word combinations, as well as for distinguishing them from free combinations. Free combinations are (frequent) syntactically valid word combinations without lexicographic value and consequently there is no need for the description of their meaning, or syntactic role. Next, we distinguish collocations from all multiword lexical units (compounds, phraseological units and lexico-grammatical units) using the lexicographic view that multiword lexical units, whose meaning is not a sum of its parts, require a description of their meaning whereas collocations do not. In the final part, we return to the three aspects of collocation and their role in automatic extraction of collocational information from corpora. Semantic criterion or dictionary relevance of extracted collocations has particularly exposed the problem of semantically broad collocates such as certain types of adverbs, adjectives and verbs, and word which feature in different syntactic roles (e.g. pronouns and adjuncts). We discuss a particular issue of collocations related to proper names and the decisions about their inclusion into the dictionary based on the evaluation of lexicographers.

[1]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[2]  Stefan Th. Gries,et al.  50-something years of work on collocations: What is or should be next … , 2013 .

[3]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[4]  Héctor Martínez Alonso,et al.  Multiword Expressions: Between Lexicography and NLP , 2018, International Journal of Lexicography.

[5]  Polona Gantar Leksikografski opis slovenščine v digitalnem okolju , 2015 .

[6]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[7]  Simon Krek,et al.  Discovering Automated Lexicography: The Case of the Slovene Lexical Database , 2016 .

[8]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[9]  Iztok Kosem,et al.  The attitude of dictionary users towards automatically extracted collocation data: A user study , 2020 .

[10]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[11]  Lana Hudeček,et al.  Collocations in the Croatian Web Dictionary - Mrežnik , 2020 .

[12]  Bengt Altenberg,et al.  Amplifier collocations in spoken English , 1991 .

[13]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[14]  Iztok Kosem,et al.  GDEX for Slovene , 2011 .

[15]  Adam Kilgarriff,et al.  Longest-commonest Match , 2015 .

[16]  B. T. S. Atkins,et al.  The Oxford Guide to Practical Lexicography , 2008 .

[17]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[18]  A. Cowie The Treatment of Collocations and Idioms in Learners' Dictionaries , 1981 .

[19]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20]  Daniel Wiechmann On the computation of collostruction strength: Testing measures of association as expressions of lexical bias , 2008 .

[21]  H. Palmer Second interim report on English collocations, submitted to the Tenth Annual Conference of English Teachers, under the auspices of the Institute for Research in English Teaching , 1933 .

[22]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[23]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .