DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

We describe the work carried out by the DCU-ADAPT team on the Lexical Normalisation shared task at W-NUT 2015. We train a generalised perceptron to annotate noisy text with edit operations that normalise the text when executed. Features are charactern-grams, recurrent neural network language model hidden layer activations, character class and eligibility for editing according to the task rules. We combine predictions from 25 models trained on subsets of the training data by selecting the most-likely normalisation according to a character language model. We compare the use of a generalised perceptron to the use of conditional random fields restricted to smaller amounts of training data due to memory constraints. Furthermore, we make a first attempt to ver

[1]  Grzegorz Chrupala,et al.  DCU-UVT: Word-Level Language Classification with Code-Mixed Data , 2014, CodeSwitch@EMNLP.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  F. Gers,et al.  Long short-term memory in recurrent neural networks , 2001 .

[6]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[7]  D. W. Barron Machine Translation , 1968, Nature.

[8]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[9]  武田 一哉,et al.  Recurrent Neural Networkに基づく日常生活行動認識 , 2016 .

[10]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[11]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[12]  Timothy Baldwin,et al.  Lexical normalization for social media text , 2013, TIST.

[13]  Stephen Clark,et al.  Syntactic Processing Using the Generalized Perceptron and Beam Search , 2011, CL.

[14]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[15]  Dietrich Klakow,et al.  A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters , 2010, LREC.

[16]  Grzegorz Chrupala Text segmentation with character-level text embeddings , 2013, ICML 2013.

[17]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[21]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[22]  Grzegorz Chrupala,et al.  Normalizing tweets with edit scripts and recurrent neural embeddings , 2014, ACL.