Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?

Social media is a valuable data resource for various natural language processing (NLP) tasks. However, standard NLP tools were often designed with standard texts in mind, and their performance decreases heavily when applied to social media data. One solution to this problem is to adapt the input text to a more standard form, a task also referred to as normalization. Automatic approaches to normalization have shown that they can be used to improve performance on a variety of NLP tasks. However, all of these systems are supervised, thereby being heavily dependent on the availability of training data for the correct language and domain. In this work, we attempt to overcome this dependence by automatically generating training data for lexical normalization. Starting with raw tweets, we attempt two directions, to insert non-standardness (noise) and to automatically normalize in an unsupervised setting. Our best results are achieved by automatically inserting noise. We evaluate our approaches by using an existing lexical normalization system; our best scores are achieved by custom error generation system, which makes use of some manually created datasets. With this system, we score 94.29 accuracy on the test data, compared to 95.22 when it is trained on human-annotated data. Our best system which does not depend on any type of annotation is based on word embeddings and scores 92.04 accuracy. Finally, we perform an experiment in which we asked humans to predict whether a sentence was written by a human or generated by our best model. This experiment showed that in most cases it is hard for a human to detect automatically generated sentences.

[1]  Chin-Hui Lee,et al.  Tweet Normalization with Syllables , 2015, ACL.

[2]  Yi Yang,et al.  A Log-Linear Model for Unsupervised Text Normalization , 2013, EMNLP.

[3]  Rob van der Goot MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool , 2019, ACL.

[4]  A. Cüneyd Tantug,et al.  Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches , 2019, ACL.

[5]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[6]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[7]  Yang Liu,et al.  Improving Text Normalization via Unsupervised Model and Discriminative Reranking , 2014, ACL.

[8]  Candice Proudfoot,et al.  An analysis of the relationship between writing skills and Short Messaging Service language : a self–regulatory perspective , 2011 .

[9]  Carlos G'omez-Rodr'iguez,et al.  Towards robust word embeddings for noisy texts , 2019, Applied Sciences.

[10]  Gertjan van Noord,et al.  A Taxonomy for In-depth Evaluation of Normalization for User Generated Content , 2018, LREC.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[13]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[14]  Jennifer Foster,et al.  GenERRate: Generating Errors for Use in Grammatical Error Detection , 2009, BEA@NAACL.

[15]  Walter Daelemans,et al.  Multimodular Text Normalization of Dutch User-Generated Content , 2016, ACM Trans. Intell. Syst. Technol..

[16]  van der Goot,et al.  Normalization and parsing algorithms for uncertain input , 2019 .

[17]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[18]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[19]  Gertjan van Noord,et al.  MoNoise: Modeling Noise Using a Modular Normalization System , 2017, ArXiv.

[20]  Eduard H. Hovy,et al.  Unsupervised Mining of Lexical Variants from Noisy Text , 2011, ULNLP@EMNLP.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[23]  Sebastian Riedel,et al.  Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection , 2018, EMNLP.