A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining

User-generated content from social media is produced in many languages, making it technically challenging to compare the discussed themes from one domain across different cultures and regions. It is relevant for domains in a globalized world, such as market research, where people from two nations and markets might have different requirements for a product. We propose a simple, modern, and effective method for building a single topic model with sentiment analysis capable of covering multiple languages simultanteously, based on a pre-trained state-of-the-art deep neural network for natural language understanding. To demonstrate its feasibility, we apply the model to newspaper articles and user comments of a specific domain, i.e., organic food products and related consumption behavior. The themes match across languages. Additionally, we obtain an high proportion of stable and domain-relevant topics, a meaningful relation between topics and their respective textual contents, and an interpretable representation for social media documents. Marketing can potentially benefit from our method, since it provides an easy-to-use means of addressing specific customer interests from different market regions around the globe. For reproducibility, we provide the code, data, and results of our studya. ahttps://github.com/apairofbigwings/cross-lingual-opinion-mining

[1]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[2]  David Lazer,et al.  A Frame of Mind: Using Statistical Models for Detection of Framing and Agenda Setting Campaigns , 2015, ACL.

[3]  Walter Daelemans,et al.  Pattern for Python , 2012, J. Mach. Learn. Res..

[4]  Gerard de Melo,et al.  Detecting Cross-Cultural Differences Using a Multilingual Topic Model , 2016, TACL.

[5]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[6]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[7]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[8]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[9]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[10]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[11]  Marie-Francine Moens,et al.  Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora , 2013, Information Retrieval.

[12]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Mirella Lapata,et al.  Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised , 2018, EMNLP.

[15]  Yulia Tsvetkov,et al.  Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies , 2018, EMNLP.

[16]  San-Yih Hwang,et al.  Incorporating Word Embedding into Cross-Lingual Topic Modeling , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[17]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Hwee Tou Ng,et al.  An Unsupervised Neural Attention Model for Aspect Extraction , 2017, ACL.

[20]  Qing Xie,et al.  Monolingual and multilingual topic analysis using LDA and BERT embeddings , 2020, J. Informetrics.

[21]  Dirk Hovy,et al.  Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence , 2021, ACL/IJCNLP.

[22]  Namuk Ko,et al.  Identifying Product Opportunities Using Social Media Mining: Application of Topic Modeling and Chance Discovery Theory , 2018, IEEE Access.

[23]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[24]  Charles A. Sutton,et al.  Autoencoding Variational Inference For Topic Models , 2017, ICLR.

[25]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[26]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[27]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.