论文信息 - Building a Japanese Typo Dataset from Wikipedia's Revision History

Building a Japanese Typo Dataset from Wikipedia's Revision History

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi | Yu Tanaka

[1] Idan Szpektor,et al. DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion , 2019, NAACL.

[2] Manabu Okumura,et al. A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis , 2013, IJCNLP.

[3] Yonatan Belinkov,et al. Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[4] Torsten Zesch,et al. Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History , 2012, EACL.

[5] Chris Callison-Burch,et al. Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[6] Shuly Wintner,et al. Native Language Identification with User Generated Content , 2018, EMNLP.

[7] Yuji Matsumoto,et al. Improving Neural Text Normalization with Data Augmentation at Character- and Morphological Levels , 2017, IJCNLP.

[8] Yuji Matsumoto,et al. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[9] Yukino Baba,et al. How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs , 2012, ACL.

[10] Brendan T. O'Connor,et al. Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[11] Guillaume Wisniewski,et al. Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[12] Masato Hagiwara,et al. GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors , 2019, LREC.

[13] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[14] Yuji Matsumoto,et al. Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[15] Yuen-Hsien Tseng,et al. Building a TOCFL Learner Corpus for Chinese Grammatical Error Diagnosis , 2018, LREC.

[16] Nicholas Diakopoulos,et al. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs , 2011, EMNLP.

[17] Daisuke Kawahara,et al. Juman++: A Morphological Analysis Toolkit for Scriptio Continua , 2018, EMNLP.

[18] Kevin Duh,et al. Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network , 2016, AAAI.