Strategies for building wordnets for under-resourced languages: The case of African languages

The African Wordnet Project (AWN) aims at building wordnets for five African languages: Setswana, isiXhosa, isiZulu, Sesotho sa Leboa (also referred to as Sepedi or Northern Sotho) and Tshivenda. Currently, the so-called expand model, based on the structure of the English Princeton WordNet (PWN), is used to continually develop the African Wordnets manually. This is a labour-intensive work that needs to be performed by linguistic experts, guided by several considerations such as the level of lexicalisation of a term in the African language. Up to now, linguists were responsible for identifying and translating appropriate synsets without much help from electronic resources because in the case of African languages even basic resources such as computer readable and electronic bilingual wordlists are usually not freely available. Methods to speed up the manual development of synsets and ease the workload of the human language experts were recently investigated. These centred around utilising the minimal amount of information available in bilingual dictionaries to identify synsets in the PWN that should be included in the AWN, transferring information from dictionaries to the wordnet and presenting the potential synsets to linguists for final approval and inclusion in the wordnets. In this article, we describe the methodology developed for building the African Wordnets, a potentially significant resource for natural language processing applications. Available resources that could be taken advantage of and resources that had to be developed are investigated, and initial results and future plans are explained.

[1]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[2]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[3]  Maciej Piasecki,et al.  Tools for plWordNet Development. Presentation and Perspectives , 2012, LREC.

[4]  Marthinus W. Pretorius,et al.  The South African Human Language Technology Audit , 2011, Lang. Resour. Evaluation.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Aleš Horák,et al.  DEBVisDic: Instant Wordnet Building , 2016, GWC.

[7]  N. Mollema,et al.  Developing legal terminology in African languages as aid to the court interpreter: a South African perspective , 2013 .

[8]  Christiane Fellbaum,et al.  Toward a truly multilingual GlobalWordNet , 2016 .

[9]  Peter A. Flach,et al.  Ukwabelana - An open-source morphological Zulu corpus , 2010, COLING.

[10]  Verginica Barbu Mititelu,et al.  Romanian WordNet: New Developments and Applications , 2006 .

[11]  Veronika Vincze,et al.  Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian , 2014, GWC.

[12]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .

[13]  Takenobu Tokunaga,et al.  Automatic Generation of English Vocabulary Tests , 2015, CSEDU.

[14]  Zygmunt Vetulani,et al.  PolNet — Polish WordNet: Data and Tools , 2010, LREC.

[15]  M. L. Mojapelo,et al.  African WordNet: A Viable Tool for Sense Discrimination in the Indigenous African Languages of South Africa , 2016, GWC.

[16]  Sonja E. Bosch,et al.  Towards Zulu corpus clean-up, lexicon development and corpus annotation by means of computational morphological analysis , 2011 .

[17]  Mampaka L. Mojapelo Semantics of body parts in African WordNet: a case of Northern Sotho , 2016, GWC.

[18]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[19]  Rosziati Ibrahim,et al.  Managing Information by Utilizing WordNet as the Database for Semantic Search Engine , 2015 .

[20]  Sonja E. Bosch,et al.  Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology , 2009 .

[21]  Sonja Bosch,et al.  Taking stock of the African Wordnet project: 5 years of development , 2014, GWC.

[22]  Bar-Ilan University,et al.  WordNet : a Test Case of Aligning Lexical Databases across Languages , 2007 .

[23]  Roald Eiselen,et al.  Developing Text Resources for Ten South African Languages , 2014, LREC.

[24]  Heshaam Faili,et al.  Automatic Persian WordNet Construction , 2010, COLING.

[25]  Pavel Smrž Quality Control for Wordnet Development , 2004 .

[26]  Gideon Kotzé,et al.  Ontwikkeling van 'n Afrikaanse woordnet: metodologie en integrasie , 2008 .

[27]  Jyrki Niemi,et al.  Is it possible to create a very large wordnet in 100 days? An evaluation , 2013, Language Resources and Evaluation.

[28]  Luisa Bentivogli,et al.  Looking for lexical gaps , 2000 .

[29]  Shilpa Desai,et al.  An Efficient Database Design for IndoWordNet Development Using Hybrid Approach , 2012, WSSANLP@COLING.

[30]  Christiane Fellbaum,et al.  Connecting the Universal to the Specific: Towards the Global Grid , 2007, IWIC.

[31]  Winston N. Anderson,et al.  Base Concepts in the African Languages Compared to Upper Ontologies and the WordNet Top Ontology , 2010, LREC.

[32]  Antoni Oliver WN-Toolkit: Automatic generation of WordNets following the expand model , 2014, GWC.

[33]  Sergi Cervell,et al.  Methods and Tools for Building the Catalan WordNet , 1998, ArXiv.

[34]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.