Populating sub-entries in dictionaries with multi-word units from concordance lines

Abstract: Lexicography is primarily concerned with the representation of words and their senses in dic-tionaries. By words most dictionary users and lexicographers refer to a combination of characters delineated by spaces on both sides. This article discusses the weakness of this approach in the selection of dictionary en-tries. Through an inspection of concordance lines generated from a multi-million Setswana corpus, it is ar-gued and demonstrated how multi-word units (MWUs), also known as multi-word expressions (MWEs), may be extracted from concordance lines to supplement dictionary entries. It is illustrated how both mono-lingual and bilingual Setswana dictionaries may be enhanced by the addition of MWEs as sub-entries. Keywords: SETSWANA, LEXICOGRAPHY, MULTI-WORD UNIT, CORPUS, CONCOR-DANCE, MULTI-WORD EXPRESSION, COLLOCATION, WORD, SUB-ENTRIES, DICTIONARY Opsomming: Die aanvulling van subinskrywings in woordeboeke met meerwoordige eenhede uit konkordansiereels. Leksikografie is hoofsaaklik gemoeid met die weergawe van woorde en hul betekenisse in woordeboeke. Met woorde verwys die meeste woordeboekgebruikers en leksikograwe na 'n kombinasie van lettertekens afgegrens deur spasies aan beide kante. Hierdie artikel bespreek die swakheid van hierdie benadering by die keuse van woordeboekinskrywings. Deur 'n ondersoek van konkordansiereels gegenereer uit 'n multimiljoen-Setswanakorpus, word daar geredeneer en verduidelik hoe meerwoordige eenhede (MWE's), ook bekend as meerwoordige uitdrukkings (MWU's), uit konkordansiereels onttrek kan word om woordeboekinskrywings aan te vul. Daar word aangetoon hoe sowel eentalige as meertalige Setswanawoordeboeke uitgebrei kan word deur die toevoeging van MWU's as subinskrywings. Sleutelwoorde: SETSWANA, LEKSIKOGRAFIE, MEERWOORDIGE EENHEID, KORPUS, KONKORDANSIE, MEERWOORDIGE UITDRUKKING, KOLLOKASIE, WOORD, SUBINSKRY-WINGS, WOORDEBOEK

[1]  Kathrin Steyer,et al.  Corpus-driven study of multi-word expressions based on collocations from a very large corpus , 2007 .

[2]  Kemal Oflazer,et al.  Integrating Morphology with Multi-word Expression Processing in Turkish , 2004 .

[3]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[4]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[5]  Geoffrey Finch,et al.  Linguistic terms and concepts , 1999 .

[6]  Geoffrey Leech,et al.  English Grammar for Today: A New Introduction , 1982 .

[7]  Thapelo J. Otlogetswe,et al.  Corpus design for Setswana lexicography , 2008 .

[8]  A. P. B. Sardinha Corpus linguistics - investigating language structure and use , 1999 .

[9]  Jean Aitchison Teach Yourself Linguistics , 1987 .

[10]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[11]  R. Moon Fixed Expressions and Idioms in English: A Corpus-Based Approach , 1998 .

[12]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[13]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[14]  Tom McArthur,et al.  Living Words: Language, Lexicography and the Knowledge Revolution , 1998 .

[15]  Serge Sharoff,et al.  What is at Stake: a Case Study of Russian Expressions Starting with a Preposition , 2004 .

[16]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[17]  Niladri Sekhar Dash The process of designing a multidisciplinary monolingual sample corpus , 2000 .

[18]  Aline Villavicencio,et al.  Lexical Encoding of MWEs , 2004 .

[19]  Judy Pearshall,et al.  The new Oxford dictionary of English. , 2000 .

[20]  George R. Doddington CSR Corpus Development , 1992, HLT.

[21]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .