Building a Japanese Typo Dataset from Wikipedia's Revision History

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

[1]  Idan Szpektor,et al.  DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion , 2019, NAACL.

[2]  Manabu Okumura,et al.  A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis , 2013, IJCNLP.

[3]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[4]  Torsten Zesch,et al.  Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History , 2012, EACL.

[5]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[6]  Shuly Wintner,et al.  Native Language Identification with User Generated Content , 2018, EMNLP.

[7]  Yuji Matsumoto,et al.  Improving Neural Text Normalization with Data Augmentation at Character- and Morphological Levels , 2017, IJCNLP.

[8]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[9]  Yukino Baba,et al.  How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs , 2012, ACL.

[10]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[11]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[12]  Masato Hagiwara,et al.  GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors , 2019, LREC.

[13]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[14]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[15]  Yuen-Hsien Tseng,et al.  Building a TOCFL Learner Corpus for Chinese Grammatical Error Diagnosis , 2018, LREC.

[16]  Nicholas Diakopoulos,et al.  Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs , 2011, EMNLP.

[17]  Daisuke Kawahara,et al.  Juman++: A Morphological Analysis Toolkit for Scriptio Continua , 2018, EMNLP.

[18]  Kevin Duh,et al.  Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network , 2016, AAAI.