BasiLex: An 11.5 million words corpus of Dutch texts written for children

This article discusses Basilex, a 13.5-million tokens, 11.5-million Dutch words corpus of written language offered to children in the elementary school age, which was recently finalized. The corpus is automatically analyzed at the levels of part-of-speech tagging and lemmatization, and a limited amount of polysemous words has been partly automatically disambiguated. Also, a lemma-based lexicon is derived. The aim of the present article is threefold: First, to give a description of BasiLex and how it was built, and to discuss its validity. Second, to compare the BasiLex lexicon with two other lexicons regarding differences in their most frequent words: the Schrooten and Vermeer (1994) lexicon, a small and now outdated Dutch corpus of language addressed to children, and a derived lexicon of SoNaR, an adult written language corpus (Oostdijk et al. 2013). Third, we discuss some potential educational applications of BasiLex.

[1]  Onderzoek woordfrequentie : Resultaten kranten , 1962 .

[2]  Werkgroep Frequentie-onderzoek van het Nederlands,et al.  Woordfrequenties in geschreven en gesproken Nederlands , 1975 .

[3]  E. Finegan Language : Its Structure and Use , 1989 .

[4]  A. Bryk,et al.  Early vocabulary growth: Relation to language input and gender. , 1991 .

[5]  A. Rudell Frequency of word usage and perceived word difficulty: Ratings of Kučera and Francis words , 1993 .

[6]  R. P. Carver Percentage of Unknown Vocabulary Words in Text as a Function of the Relative Difficulty of the text: Implications for Instruction , 1994 .

[7]  P. Nation,et al.  Vocabulary size and use: Lexical richness in L2 written production , 1995 .

[8]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[9]  R. Appel,et al.  Nederlands als tweede taal in het basisonderwijs , 1996 .

[10]  G. Hitch,et al.  Separate effects of word frequency and age of acquisition in recognition and recall. , 1998 .

[11]  S. Gerhand,et al.  Word frequency effects in oral reading are not merely age-of-acquisition effects in disguise. , 1998 .

[12]  P. Nation,et al.  A vocabulary-size test of controlled productive ability , 1999 .

[13]  Jeanine Treffers-Daller,et al.  De meting van woordenschatrijkdom in het Turks van Turks-Duits tweetaligen , 1999 .

[14]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[15]  P. Nation,et al.  Unknown vocabulary density and reading comprehension , 2020 .

[16]  A. Vermeer Breadth and depth of vocabulary in relation to L1/L2 acquisition and frequency of input , 2001, Applied Psycholinguistics.

[17]  Alison Wray,et al.  Formulaic Language and the Lexicon: List of Figures and Tables , 2002 .

[18]  A. Tellings,et al.  Mode of acquisition of word meanings: The viability of a theoretical construct , 2003, Applied Psycholinguistics.

[19]  A. Vermeer,et al.  Een Passende Woordkeus: Het kiezen van Woorden voor Woordenschatlessen , 2003 .

[20]  R. Hout,et al.  Lexical richness in the spontaneous speech of bilinguals , 2003 .

[21]  A. Vermeer The relation between lexical richness and vocabulary size in Dutch L1 and L2 children , 2004 .

[22]  Hugo Van hamme,et al.  JASMIN-CGN: Extension of the Spoken Dutch Corpus with Speech of Elderly People, Children and Non-natives in the Human-Machine Interaction Modality , 2006, LREC.

[23]  A. Vermeer,et al.  Literacy achievement of children with intellectual disabilities and differing linguistic backgrounds. , 2006, Journal of intellectual disability research : JIDR.

[24]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[25]  Colin Bannard,et al.  Stored Word Sequences in Language Learning , 2008, Psychological science.

[26]  J. Elman On the Meaning of Words and Dinosaur Bones: Lexical Knowledge Without a Lexicon , 2009, Cogn. Sci..

[27]  N. J. Goossens,et al.  Wat Is Een Optimale Tekstdekking? Woordkennis En Tekstbegrip In Groep , 2009 .

[28]  N. Snider,et al.  More than words: Frequency effects for multi-word phrases , 2010 .

[29]  Joanne Lee Size matters: Early vocabulary as a predictor of language and literacy competence , 2011 .

[30]  Antal van den Bosch,et al.  DutchSemCo: building a semantically annotated corpus for Dutch , 2011 .

[31]  Ludo Verhoeven,et al.  Vocabulary Growth and Reading Development across the Elementary School Years , 2011 .

[32]  R. Baayen,et al.  Effects of morphological Family Size for young readers. , 2012, The British journal of developmental psychology.

[33]  Morten H. Christiansen,et al.  How hierarchical is language use? , 2012, Proceedings of the Royal Society B: Biological Sciences.

[34]  Katja Hofmann,et al.  Cornetto: A Combinatorial Lexical Semantic Database for Dutch , 2013, Essential Speech and Language Technology for Dutch.

[35]  Nelleke Oostdijk,et al.  The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[36]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.