论文信息 - Liputan6: A Large-scale Indonesian Dataset for Text Summarization - 字舞流文

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document– summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machinegenerated summaries that have low ROUGE scores, and expose both issues with ROUGE itself, as well as with extractive and abstractive summarization models.

Timothy Baldwin | Fajri Koto | Jey Han Lau | Timothy Baldwin | Fajri Koto

[1] Gunhee Kim,et al. Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[2] Rahmat Budiarto,et al. Automatic Text Summarization for Indonesian Language Using TextTeaser , 2017 .

[3] Bowen Zhou,et al. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[4] Franck Dernoncourt,et al. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[5] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[6] M. de Rijke,et al. The impact of stemming on information retrieval in Bahasa Indonesia , 2003 .

[7] Min Sun,et al. A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss , 2018, ACL.

[8] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[9] E. Sanchis,et al. Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks , 2019, Applied Sciences.

[10] Lu Wang,et al. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization , 2019, ACL.

[11] Ahmad Najibullah,et al. Indonesian Text Summarization based on Naïve Bayes Method , 2015 .

[12] Yao Zhao,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[13] Michael Elhadad,et al. Question Answering as an Automatic Evaluation Metric for News Article Summarization , 2019, NAACL.

[14] Vít Suchomel,et al. Indonesian web corpus (idWac) , 2017 .

[15] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[16] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[17] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[18] Qingcai Chen,et al. LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[19] Bowen Zhou,et al. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[20] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[21] Samuel Louvan,et al. Indosum: A New Benchmark Dataset for Indonesian Text Summarization , 2018, 2018 International Conference on Asian Language Processing (IALP).

[22] Mirella Lapata,et al. Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[23] Fajri Koto. A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization , 2016, LREC.

[24] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[25] Timothy Baldwin,et al. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP , 2020, COLING.

[26] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[27] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[28] Rini Wongso,et al. Summarizing Text for Indonesian Language by Using Latent Dirichlet Allocation and Genetic Algorithm , 2014 .

[29] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[30] Xiaojun Wan,et al. Abstractive Document Summarization with a Graph-Based Attentional Neural Model , 2017, ACL.

[31] Mirella Lapata,et al. Text Summarization with Pretrained Encoders , 2019, EMNLP.

[32] Yeni Herdiyeni,et al. Text Feature Weighting for Summarization of Documents in Bahasa Indonesia Using Genetic Algorithm , 2012 .

[33] Alexander M. Rush,et al. Bottom-Up Abstractive Summarization , 2018, EMNLP.

[34] Alex Wang,et al. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[35] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.