Instance Selection Improves Cross-Lingual Model Training for Fine-Grained Sentiment Analysis

Scarcity of annotated corpora for many languages is a bottleneck for training finegrained sentiment analysis models that can tag aspects and subjective phrases. We propose to exploit statistical machine translation to alleviate the need for training data by projecting annotated data in a source language to a target language such that a supervised fine-grained sentiment analysis system can be trained. To avoid a negative influence of poor-quality translations, we propose a filtering approach based on machine translation quality estimation measures to select only high-quality sentence pairs for projection. We evaluate on the language pair German/English on a corpus of product reviews annotated for both languages and compare to in-target-language training. Projection without any filtering leads to 23 % F1 in the task of detecting aspect phrases, compared to 41 % F1 for in-target-language training. Our approach obtains up to 47 % F1. Further, we show that the detection of subjective phrases is competitive to in-target-language training without filtering.

[1]  Iryna Gurevych,et al.  Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields , 2010, EMNLP.

[2]  Yulia Tsvetkov,et al.  Metaphor Detection with Cross-Lingual Model Transfer , 2014, ACL.

[3]  B. Alexandra,et al.  Rethinking Sentiment Analysis in the News: from Theory to Practice and back , 2009 .

[4]  Lucia Specia,et al.  QuEst - A translation quality estimation framework , 2013, ACL.

[5]  Eleftherios Avramidis,et al.  Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features , 2011, WMT@EMNLP.

[6]  Hatem Ghorbel,et al.  Experiments in Cross-Lingual Sentiment Analysis in Discussion Forums , 2012, SocInfo.

[7]  Philipp Cimiano,et al.  The USAGE review corpus for fine grained multi lingual opinion analysis , 2014, LREC.

[8]  Janyce Wiebe,et al.  Annotating Attributions and Private States , 2005, FCA@ACL.

[9]  Ting Liu,et al.  Creating a Fine-Grained Corpus for Chinese Sentiment Analysis , 2015, IEEE Intelligent Systems.

[10]  Kerstin Denecke,et al.  Using SentiWordNet for multilingual sentiment analysis , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[11]  Kam-Fai Wong,et al.  Cross lingual opinion holder extraction based on multi-kernel SVMs and transfer learning , 2013, World Wide Web.

[12]  Andrew McCallum,et al.  SampleRank: Training Factor Graphs with Atomic Gradients , 2011, ICML.

[13]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[14]  Xuanjing Huang,et al.  Opinion Mining with Sentiment Graph , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[15]  Maite Taboada,et al.  Cross-Linguistic Sentiment Analysis: From English to Spanish , 2009, RANLP.

[16]  Jörg Tiedemann,et al.  Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets , 2014, EMNLP 2014.

[17]  Imed Zitouni,et al.  Multilingual Natural Language Processing Applications: From Theory to Practice , 2012 .

[18]  Xiaojun Wan,et al.  CLOpinionMiner: Opinion Target Extraction in a Cross-Language Scenario , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Philipp Cimiano,et al.  Bi-directional Inter-dependencies of Subjective Expressions and Targets and their Value for a Joint Model , 2013, ACL.

[20]  Iryna Gurevych,et al.  Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews , 2010, ACL.

[21]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[22]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[23]  Ivan Titov,et al.  Cross-lingual Model Transfer Using Feature Representation Projection , 2014, ACL.

[24]  Lucia Specia,et al.  An Investigation on the Effectiveness of Features for Translation Quality Estimation , 2013, MTSUMMIT.

[25]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[26]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[27]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[28]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[29]  Andrew McCallum,et al.  FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs , 2009, NIPS.

[30]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for POS Tagging , 2008, EMNLP.

[31]  Iñaki San Vicente,et al.  TASS: Detecting Sentiments in Spanish Tweets , 2012 .

[32]  Hermann Ney,et al.  Confidence measures for statistical machine translation , 2003, MTSUMMIT.

[33]  Sung-Hyon Myaeng,et al.  Detecting Opinions and their Opinion Targets in NTCIR-8 , 2010, NTCIR.

[34]  Yaser Al-Onaizan,et al.  Goodness: A Method for Measuring Machine Translation Confidence , 2011, ACL.

[35]  Alexandra Balahur,et al.  Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis , 2014, Comput. Speech Lang..

[36]  Alessandro Moschitti,et al.  SenTube: A Corpus for Sentiment Analysis on YouTube Social Media , 2014, LREC.

[37]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[38]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[39]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[40]  Xiaoyan Zhu,et al.  Sentiment Analysis with Global Topics and Local Dependency , 2010, AAAI.

[41]  Claire Cardie,et al.  Extracting Opinion Expressions with semi-Markov Conditional Random Fields , 2012, EMNLP.

[42]  Katarina Boland,et al.  Creating an Annotated Corpus for Sentiment Analysis of German Product Reviews , 2013 .

[43]  Roberto Basili,et al.  Cross-Language Frame Semantics Transfer in Bilingual Corpora , 2009, CICLing.

[44]  Lucia Specia,et al.  Quality estimation for translation selection , 2014, EAMT.

[45]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[46]  Anders Søgaard Data point selection for cross-language adaptation of dependency parsers , 2011, ACL.

[47]  Alessandro Moschitti,et al.  Multi-lingual opinion mining on YouTube , 2016, Inf. Process. Manag..

[48]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[49]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[50]  Mirella Lapata,et al.  Cross-lingual Annotation Projection for Semantic Roles , 2009, J. Artif. Intell. Res..

[51]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[52]  Richard Johansson,et al.  Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models , 2011, ACL.

[53]  Rada Mihalcea,et al.  A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources , 2008, LREC.

[54]  Rada Mihalcea,et al.  Multilingual Subjectivity: Are More Languages Better? , 2010, COLING.

[55]  Flavius Frasincar,et al.  Sentiment Analysis with a Multilingual Pipeline , 2011, WISE.

[56]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[57]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[58]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.