Stacked DeBERT: All Attention in Incomplete Data for Text Classification

In this paper, we propose Stacked DeBERT, short for StackedDenoising Bidirectional Encoder Representations from Transformers. This novel model improves robustness in incomplete data, when compared to existing systems, by designing a novel encoding scheme in BERT, a powerful language representation model solely based on attention mechanisms. Incomplete data in natural language processing refer to text with missing or incorrect words, and its presence can hinder the performance of current models that were not implemented to withstand such noises, but must still perform well even under duress. This is due to the fact that current approaches are built for and trained with clean and complete data, and thus are not able to extract features that can adequately represent incomplete data. Our proposed approach consists of obtaining intermediate input representations by applying an embedding layer to the input tokens followed by vanilla transformers. These intermediate features are given as input to novel denoising transformers which are responsible for obtaining richer input representations. The proposed approach takes advantage of stacks of multilayer perceptrons for the reconstruction of missing words' embeddings by extracting more abstract and meaningful hidden feature vectors, and bidirectional transformers for improved embedding representation. We consider two datasets for training and evaluation: the Chatbot Natural Language Understanding Evaluation Corpus and Kaggle's Twitter Sentiment Corpus. Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in the sentiment and intent classification tasks.1.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[3]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Hassan Ouahmane,et al.  Automatic Speech Recognition Errors Detection and Correction: A Review , 2015, ICNLSP.

[6]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[7]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[8]  R. Ahmed,et al.  EVOLUTION OF ENGLISH IN THE INTERNET AGE , 2018 .

[9]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[10]  Raymond R. Panko,et al.  Thinking is Bad: Implications of Human Error Research for Spreadsheet Research and Practice , 2008, ArXiv.

[11]  Alessandro Vinciarelli Noisy Text Categorization , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Marie-Francine Moens,et al.  Using Related Text Sources to Improve Classification of Transcribed Speech Data , 2019, AMLTA.

[13]  Adrian Hernandez-Mendez,et al.  Evaluating Natural Language Understanding Services for Conversational Question Answering Systems , 2017, SIGDIAL Conference.

[14]  Peerapon Vateekul,et al.  A study of sentiment analysis using deep learning techniques on Thai Twitter data , 2016, 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[15]  Shourya Roy,et al.  How Much Noise Is Too Much: A Study in Automatic Text Classification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[16]  Lovedeep Gondara,et al.  Multiple Imputation Using Deep Denoising Autoencoders , 2017, ArXiv.

[17]  Miriam Seoane Santos,et al.  Missing Data Imputation via Denoising Autoencoders: The Untold Story , 2018, IDA.

[18]  J. Grieve,et al.  Mapping Lexical Innovation on American Social Media , 2018, Journal of English Linguistics.

[19]  Marcus Liwicki,et al.  Subword Semantic Hashing for Intent Classification on Small Datasets , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[20]  Deepali Deshpande,et al.  Twitter Sentiment Analysis System , 2018, International Journal of Computer Applications.

[21]  Minho Lee,et al.  Temporal Hierarchies in Sequence to Sequence for Sentence Correction , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[22]  Yu Shi,et al.  Improving Readability for Automatic Speech Recognition Transcription , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[23]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[25]  Boris Ginsburg,et al.  Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[27]  Igi Ardiyanto,et al.  A review of missing values handling methods on time-series data , 2016, 2016 International Conference on Information Technology Systems and Innovation (ICITSI).

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Gökhan Tür,et al.  Intent detection using semantically enriched word embeddings , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30]  Ke Wang,et al.  MIDA: Multiple Imputation Using Denoising Autoencoders , 2017, PAKDD.