Reference List of Slovene Frequent Common Words

The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text corpora: the balanced reference corpus of written Slovene Kres, the reference corpus of spoken Slovene GOS, the corpus of computer-mediated communication Janes and the corpus of school written production Solar 2.0. The list was additionally manually cleaned and contains 4,768 common general lemmas. The file is in a tab separated format, containing lemma, part-of-speech (following the MULTEXT-East tagset for Slovene), relative average reduced frequency in each of the corpora, and the final average score computed from these values. The dataset is described in more detail in: Spela Arhar Holdt, Senja Pollak, Marko Robnik Sikonja, Simon Krek (2020). Referencni seznam pogostih splosnih besed za slovenscino. In the Proceedings of the Conference on Language Technologies and Digital Humanities, pp. 10-15.