论文信息 - Vector representation of non-standard spellings using dynamic time warping and a denoising autoencoder

Vector representation of non-standard spellings using dynamic time warping and a denoising autoencoder

The presence of non-standard spellings in Twitter causes challenges for many natural language processing tasks. Traditional approaches mainly regard the problem as a translation, spell checking, or speech recognition problem. This paper proposes a method that represents the stochastic relationship between words and their non-standard versions in real vectors. The method uses dynamic time warping to preprocess the non-standard spellings and autoencoder to derive the vector representation. The derived vectors encode word patterns and the Euclidean distance between the vectors represents a distance in the word space that challenges the prevailing edit distance. After training the autoencoder on 1051 different words and their non-standard versions, the results show that the new distance can be used to obtain the correct standard word among the closest five words in 89.53% of the cases compared to only 68.22% using the edit distance.

Ole-Christoffer Granmo | Morten Goodwin Olsen | Mehdi Ben Lazreg

[1] S T Roweis,et al. Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[2] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[3] Animesh Mukherjee,et al. Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[4] François Yvon,et al. Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[5] ChenLi YangLiu. Improving Text Normalization Using Character-blocks based Models and System Combination , 2012 .

[6] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[7] Meinard Müller,et al. Dynamic Time Warping , 2008 .

[8] Timothy Baldwin,et al. Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[9] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[12] J. Oberlander,et al. Proceedings of the COLING/ACL on Main Conference Poster Sessions , 2006 .

[13] L. Venkata Subramaniam,et al. Unsupervised cleansing of noisy text , 2010, COLING.

[14] Yoshua. Bengio,et al. Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[15] Hwee Tou Ng,et al. A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation , 2013, HLT-NAACL.

[16] Geoffrey E. Hinton,et al. Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[17] Carlos Castillo,et al. Big Crisis Data. , 2015, WebMedia 2015.

[18] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19] Yi Yang,et al. A Log-Linear Model for Unsupervised Text Normalization , 2013, EMNLP.

[20] Yang Liu,et al. Toward text message normalization: Modeling abbreviation generation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] José-Luis Sancho-Gómez,et al. Word Normalization in Twitter Using Finite-state Transducers , 2013, Tweet-Norm@SEPLN.

[22] Ryan P. Adams,et al. Training Restricted Boltzmann Machines on Word Observations , 2012, ICML.

[23] Jian Su,et al. A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[24] Mark Steedman,et al. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .