Learning Indonesian Frequently Used Vocabulary from Large-Scale News

Frequently used vocabulary of a language plays an important role in language learning. In this work, we introduce how we obtain Indonesian frequently used vocabulary along with a vocabulary level scheme. Techniques in natural language processing and statistics are used to process a large scale of Indonesian news. According to our scheme, an Indonesian language learner may need to learn about 11,200 words to gain adequate comprehension when reading an Indonesian article. In addition, we also compare the vocabulary distribution and usage between English and Indonesian. Our work could be helpful to Indonesian language learners and other relevant research such as language teaching.

[1]  Hokky Situngkir An Observational Framework to the Zipfian Analysis Among Different Languages: Studies to Indonesian Ethnic Biblical Texts , 2007 .

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  Septina Dian Larasati,et al.  Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[4]  Maryani Maryani,et al.  Identifying Indonesian-core vocabulary for teaching English to Indonesian preschool children: a corpus-based research , 2011 .

[5]  Chang Liu,et al.  Developing a Core Vocabulary for a Mandarin Chinese AAC System Using Word Frequency Data , 2006, Int. J. Comput. Process. Orient. Lang..

[6]  Francesc Font-Clos,et al.  Large-Scale Analysis of Zipf’s Law in English Texts , 2015, PloS one.

[7]  Norbert Schmitt,et al.  A reassessment of frequency and vocabulary size in L2 vocabulary teaching1 , 2012, Language Teaching.

[8]  I. S. P. Nation,et al.  Learning Vocabulary in Another Language: Appendixes , 2001 .

[9]  Laurence Anthony,et al.  Mid-frequency readers , 2013 .

[10]  Wang Hu Polysemous words:meaning,length and frequency , 2009 .

[11]  Stéphane Bressan,et al.  Automatic Learning of Stemming Rules for the Indonesian Language , 2003, PACLIC.

[12]  Hugh E. Williams,et al.  Stemming Indonesian: A confix-stripping approach , 2007, TALIP.

[13]  P. Nation,et al.  Unknown vocabulary density and reading comprehension , 2020 .

[14]  Widodo Budiharto,et al.  Flexible affix classification for stemming Indonesian Language , 2016, 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).

[15]  I. Nation How Large a Vocabulary Is Needed for Reading and Listening? , 2006 .

[16]  Derwin Suhartono,et al.  Lemmatization Technique in Bahasa: Indonesian Language , 2014, J. Softw..

[17]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[18]  M. Brysbaert,et al.  SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles , 2010, PloS one.

[19]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.