Cross-lingual Data Transformation and Combination for Text Classification

Text classification is a fundamental task for text data mining. In order to train a generalizable model, a large volume of text must be collected. To address data insufficiency, cross-lingual data may occasionally be necessary. Cross-lingual data sources may however suffer from data incompatibility, as text written in different languages can hold distinct word sequences and semantic patterns. Machine translation and word embedding alignment provide an effective way to transform and combine data for cross-lingual data training. To the best of our knowledge, there has been little work done on evaluating how the methodology used to conduct semantic space transformation and data combination affects the performance of classification models trained from cross-lingual resources. In this paper, we systematically evaluated the performance of two commonly used CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) text classifiers with differing data transformation and combination strategies. Monolingual models were trained from English and French alongside their translated and aligned embeddings. Our results suggested that semantic space transformation may conditionally promote the performance of monolingual models. Bilingual models were trained from a combination of both English and French. Our results indicate that a cross-lingual classification model can significantly benefit from cross-lingual data by learning from translated or aligned embedding spaces.

[1]  José A. R. Fonollosa,et al.  Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors , 2010, EAMT.

[2]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[3]  Gholamreza Haffari,et al.  Improving Word Alignment of Rare Words with Word Embeddings , 2016, COLING.

[4]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[5]  Jinmiao Huang,et al.  An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment using MIMIC-III Clinical Notes , 2018, Comput. Methods Programs Biomed..

[6]  Hervé Jégou,et al.  Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion , 2018, EMNLP.

[7]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[8]  Xiaojun Wan,et al.  Bilingual Co-Training for Sentiment Classification of Chinese Product Reviews , 2011, CL.

[9]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[10]  Pengtao Xie,et al.  Convolutional Neural Networks for Medical Diagnosis from Admission Notes , 2017, ArXiv.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Preethi Jyothi,et al.  Revisiting the Importance of Encoding Logic Rules in Sentiment Classification , 2018, EMNLP.

[13]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[14]  Daisuke Kawahara,et al.  Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion , 2018, COLING.

[15]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[16]  Florian Krebs,et al.  Social Emotion Mining Techniques for Facebook Posts Reaction Prediction , 2017, ICAART.

[17]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[18]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[19]  Hang Li,et al.  Hierarchical Bidirectional Long Short-Term Memory Networks for Chinese Messaging Spam Filtering , 2017, 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM).

[20]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[21]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[22]  Goran Glavas,et al.  Cross-Lingual Classification of Topics in Political Texts , 2017, NLP+CSS@ACL.

[23]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24]  Wei Xiao,et al.  Bingo at IJCNLP-2017 Task 4: Augmenting Data using Machine Translation for Cross-linguistic Customer Feedback Classification , 2017, IJCNLP.

[25]  Eneko Agirre,et al.  Learning principled bilingual mappings of word embeddings while preserving monolingual invariance , 2016, EMNLP.