IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, and thus it enables everyone to benchmark their system performances.

[1]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[4]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[5]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[6]  David Moeljadi,et al.  Usage of Indonesian possessive verbal predicates: a statistical analysis based on questionnaire and storytelling surveys , 2013 .

[7]  Ayu Purwarianti,et al.  Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis , 2018, 2018 International Conference on Asian Language Processing (IALP).

[8]  Ayu Purwarianti,et al.  A machine learning approach for indonesian question answering system , 2007, Artificial Intelligence and Applications.

[9]  Ayu Purwarianti,et al.  Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[10]  Alham Fikri Aji,et al.  Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging , 2018, 2018 International Conference on Asian Language Processing (IALP).

[11]  Ayu Purwarianti,et al.  Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[12]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[13]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14]  Cong Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[15]  Ayu Purwarianti,et al.  Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger , 2018, 2018 International Conference on Asian Language Processing (IALP).

[16]  Mirna Adriani,et al.  Emotion Classification on Indonesian Twitter Dataset , 2018, 2018 International Conference on Asian Language Processing (IALP).

[17]  Hiroki Nomoto,et al.  Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo) , 2019 .

[18]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Ruli Manurung,et al.  A Two-Level Morphological Analyser for the Indonesian Language , 2008, ALTA.

[21]  Daniel Zeman,et al.  CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings , 2017 .

[22]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[23]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[24]  Masayu Leylia Khodra,et al.  Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting , 2019, 2019 International Conference on Electrical Engineering and Informatics (ICEEI).

[25]  Ruli Manurung,et al.  Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus , 2014, 2014 International Conference on Asian Language Processing (IALP).

[26]  Wanxiang Che,et al.  Pre-Training with Whole Word Masking for Chinese BERT , 2019, ArXiv.

[27]  Masayu Leylia Khodra,et al.  Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[28]  Andrey Kutuzov,et al.  To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation , 2019, ArXiv.

[29]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[30]  Arie Pratama Sutiono,et al.  Aspect and Opinion Term Extraction for Aspect Based Sentiment Analysis of Hotel Reviews Using Transfer Learning , 2019, ArXiv.

[31]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[32]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[35]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .