Ukwabelana - An open-source morphological Zulu corpus

Zulu is an indigenous language of South Africa, and one of the eleven official languages of that country. It is spoken by about 11 million speakers. Although it is similar in size to some Western languages, e.g. Swedish, it is considerably under-resourced. This paper presents a new open-source morphological corpus for Zulu named Ukwabelana corpus. We describe the agglutinating morphology of Zulu with its multiple prefixation and suffixation, and also introduce our labeling scheme. Further, the annotation process is described and all single resources are explained. These comprise a list of 10,000 labeled and 100,000 unlabeled word types, 3,000 part-of-speech (POS) tagged and 30,000 raw sentences as well as a morphological Zulu grammar, and a parsing algorithm which hypothesizes possible word roots and enumerates parses that conform to the Zulu grammar. We also provide a POS tagger which assigns the grammatical category to a morphologically analyzed word type. As it is hoped that the corpus and all resources will be of benefit to any person doing research on Zulu or on computer-aided analysis of languages, they will be made available in the public domain from http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/Resources/.

[1]  Marelie H. Davel,et al.  A framework for bootstrapping morphological decomposition , 2004 .

[2]  L. C. Posthumus Relevancy and applicability of terminology concerning the essential verb categories in the African languages , 1987 .

[3]  J. V. Rauff,et al.  Finite State Morphology , 2007 .

[4]  Rochelle Lieber,et al.  On the organization of the lexicon , 1981 .

[5]  J. A. Louw,et al.  Word categories in Southern Bantu , 1984 .

[6]  P. D. Beuchat The verb in Zulu , 1966 .

[7]  M.McGee Wood,et al.  Natural language processing in Prolog , 1990 .

[8]  Sonja E. Bosch,et al.  Containing overgeneration in Zulu computational morphology , 2008 .

[9]  Mark Aronoff,et al.  Word Formation in Generative Grammar , 1979 .

[10]  Noverino N. Canonici Zulu grammatical structure , 1996 .

[11]  L. C. Posthumus,et al.  The so-called adjective in Zulu , 2000 .

[12]  C. M. Doke,et al.  Text-Book of Zulu Grammar , 2022 .

[13]  A. T. Cope An outline of Zulu grammar , 1984 .

[14]  Rachélle Gauton,et al.  Towards the recognition of a word class ‘adjective’ for Zulu , 1994 .

[15]  G Botha,et al.  Two approaches to gathering text corpora from the WorldWideWeb , 2005 .

[16]  C. M. Doke,et al.  Zulu-English dictionary, , 1972 .

[17]  M. Guthrie Comparative Bantu: An Introduction to the comparative linguistics and prehistory of the Bantu languages , 1967 .

[18]  L. W. Lanham The noun as the deep‐structure source for Nguni adjectives and relatives , 1971 .

[19]  A. S. Davey,et al.  Adjectives and Relatives in Zulu , 1984 .

[20]  Andrew van der Spuy Wordhood in Zulu 1 , 2006 .

[21]  C. M. Doke,et al.  Bantu linguistic terminology , 1935 .

[22]  Sonja E. Bosch,et al.  The effectiveness of morphological rules for an isiZulu spelling checker , 2005 .

[23]  Chris Mellish,et al.  Natural Language Processing in PROLOG , 1989 .

[24]  Sonja E. Bosch,et al.  Finite-State Computational Morphology: An Analyzer Prototype For Zulu , 2003, Machine Translation.

[25]  Andrew van der Spuy Wordhood in Zulu , 2006 .

[26]  Peter A. Flach,et al.  Learning the morphology of Zulu with different degrees of supervision , 2008, 2008 IEEE Spoken Language Technology Workshop.

[27]  Sonja E. Bosch,et al.  Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology , 2009 .

[28]  Peter A. Flach,et al.  Additional material for the Ukwabelana Zulu corpus , 2010 .