Density based active self-training for cross-lingual sentiment classification

Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language (typically English) for sentiment classification in another language. Most existing research works rely on automatic machine translation services to directly project information from one language to another. However, since machine translation quality is still far from satisfactory and also term distribution across languages may be dissimilar, these techniques cannot reach the performance of monolingual approaches. To overcome these limitations, we propose a novel learning model based on active learning and self-training to incorporate unlabeled data from the target language into the learning process. Further, in this model, we consider the density of unlabeled data to avoid outlier selection in active learning. The proposed model was applied to book review datasets in two different languages. Experiments showed that the proposed model could effectively reduce labeling efforts in comparison with some baseline methods.

[1]  Zhang Zhang,et al.  Cross-lingual text classification with model translation and document translation , 2012, ACM-SE '12.

[2]  Rada Mihalcea,et al.  Multilingual Subjectivity Analysis Using Machine Translation , 2008, EMNLP.

[3]  Patricio Martínez-Barco,et al.  Subjectivity and sentiment analysis: An overview of the current state of the area and envisaged developments , 2012, Decis. Support Syst..

[4]  Jingbo Zhu,et al.  Uncertainty-based active learning with instability estimation for text classification , 2012, TSLP.

[5]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[6]  Xiaojun Wan,et al.  Bilingual Co-Training for Sentiment Classification of Chinese Product Reviews , 2011, CL.

[7]  Rada Mihalcea,et al.  Multilingual Subjectivity: Are More Languages Better? , 2010, COLING.

[8]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[9]  Jingbo Zhu,et al.  Active Learning With Sampling by Uncertainty and Density for Data Annotations , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[11]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[12]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[13]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[14]  Yong Yu,et al.  Cross-Lingual Sentiment Classification via Bi-view Non-negative Matrix Tri-Factorization , 2011, PAKDD.

[15]  Benno Stein,et al.  Cross-Lingual Adaptation Using Structural Correspondence Learning , 2010, TIST.