Is it possible to create a very large wordnet in 100 days? An evaluation

Wordnets are large-scale lexical databases of related words and concepts, useful for language-aware software applications. They have recently been built for many languages by using various approaches. The Finnish wordnet, FinnWordNet (FiWN), was created by translating the more than 200,000 word senses in the English Princeton WordNet (PWN) 3.0 in 100 days. To ensure quality, they were translated by professional translators. The direct translation approach was based on the assumption that most synsets in PWN represent language-independent real-world concepts. Thus also the semantic relations between synsets were assumed mostly language-independent, so the structure of PWN could be reused as well. This approach allowed the creation of an extensive Finnish wordnet directly aligned with PWN and also provided us with a translation relation and thus a bilingual wordnet usable as a dictionary. In this paper, we address several concerns raised with regard to our approach, many of them for the first time. We evaluate the craftsmanship of the translators by checking the spelling and translation quality, the viability of the approach by assessing the synonym quality both on the lexeme and concept level, as well as the usefulness of the resulting lexical resource both for humans and in a language-technological task. We discovered no new problems compared with those already known in PWN. As a whole, the paper contributes to the scientific discourse on what it takes to create a very large wordnet. As a side-effect of the evaluation, we extended FiWN to contain 208,645 word senses in 120,449 synsets, effectively making version 2.0 of FiWN currently the largest wordnet in the world by these statistics.

[1]  Benoît Sagot,et al.  Combining Multiple Resources to Build Reliable Wordnets , 2008, TSD.

[2]  Peter Oram WordNet: An electronic lexical database. Christiane Fellbaum (Ed.). Cambridge, MA: MIT Press, 1998. Pp. 423. , 2001, Applied Psycholinguistics.

[3]  Markus Forsberg,et al.  Linking and Validating Nordic and Baltic Wordnets - A Multilingual Action in META-NORD , 2012 .

[4]  Jungyun Seo,et al.  Multiple Heuristics and Their Combination for Automatic WordNet Mapping , 2004, Comput. Humanit..

[5]  Hitoshi Isahara,et al.  Boot-Strapping a WordNet Using Multiple Existing WordNets , 2008, LREC.

[6]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[7]  Jyrki Niemi,et al.  Representing the Translation Relation in a Bilingual Wordnet , 2012, LREC.

[8]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Krister Lindén,et al.  Using a Bilingual Resource to Add Synonyms to a Wordnet , 2012 .

[10]  Lauri Carlson,et al.  FinnWordNet - WordNet på finska via översättning , 2010 .

[11]  Benoît Sagot,et al.  Building a free French wordnet from multilingual resources , 2008 .

[12]  Nina Martola FinnWordNet och kulturbundna ord , 2011 .

[13]  Krister Lindén,et al.  Finding a Location for a New Word in WordNet , 2012 .

[14]  Krister Lindén,et al.  Do wordnets also improve human performance on NLP tasks? , 2011, NODALIDA.

[15]  Hitoshi Isahara,et al.  Thai WordNet Construction , 2009, ALR7@IJCNLP.

[16]  Horacio Rodríguez,et al.  Combining Multiple Methods for the Automatic Construction of Multilingual WordNets , 1997, ArXiv.

[17]  Jyrki Niemi,et al.  Extending and Updating the Finnish Wordnet , 2012, Shall We Play the Festschrift Game?.

[18]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[19]  Markus Forsberg,et al.  Nordic and Baltic wordnets aligned and compared through "WordTies" , 2013, NODALIDA.

[20]  Martin Saveski,et al.  Automatic Construction of Wordnets by Using Machine Translation and Language Modeling , 2010 .

[21]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[22]  D. Tufis,et al.  BalkaNet : Aims , Methods , Results and Perspectives . A General Overview , 2004 .

[23]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .