Monotone submodular subset for sentiment analysis of online reviews

Along with online social media’s prosperity, the amount of user-generated reviews dramatically increases. The kinds of text-based user-generated content are conducive to estimating public sentiments. Many sentiment analysis works are based on the assumption that the sentiment expressed in online reviews can be retrieved from general text features. However, text redundancy and quantity can potentially impact the analysis performance, especially when strict corpus size constraints are applied. This paper proposes a sentiment subset selection framework to construct a small set of documents from the original corpus to convey a subjective representation. The framework can filter irrelevant sentiment information based on topic modeling and select subsets by submodular maximization with respect to a cardinality constraint. Our proposed score function can facilitate the framework to capture fine-grained sentiment features expressed in reviews compared with the conventional submodular-based one. An empirical analysis for the efficacy of the proposed sentiment subset selection framework (SentiSS) on different context domains is conducted. The comparative study of the subset’s metric impact on different sentiment levels, namely positive, neural, and negative, is also performed. Experimental results show that the SentiSS framework can compress the sentiment corpus and maintain the classifier’s performance on the metrics at the same time.

[1]  Xiaojun Wan,et al.  Automatic Labeling of Topic Models Using Text Summaries , 2016, ACL.

[2]  Hiroya Takamura,et al.  Subtree Extractive Summarization via Submodular Maximization , 2013, ACL.

[3]  Shafiq R. Joty,et al.  Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings , 2015, EMNLP.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Tommy W. S. Chow,et al.  Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information , 2005, IEEE Transactions on Neural Networks.

[6]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[7]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[8]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[9]  Jeonghee Yi,et al.  Sentiment analysis: capturing favorability using natural language processing , 2003, K-CAP '03.

[10]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[11]  Yusuke Shinohara A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Narendra Ahuja,et al.  Coreset-Based Neural Network Compression , 2018, ECCV.

[13]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[14]  Gang Hua,et al.  Connections with Robust PCA and the Role of Emergent Sparsity in Variational Autoencoder Models , 2018, J. Mach. Learn. Res..

[15]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[16]  Abhimanyu Das,et al.  Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection , 2018, J. Mach. Learn. Res..

[17]  Erik Cambria,et al.  SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis , 2020, CIKM.

[18]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Diego Reforgiato Recupero,et al.  FineNews: fine-grained semantic sentiment analysis on financial microblogs and news , 2019, Int. J. Mach. Learn. Cybern..

[21]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[22]  Sanjay Shakkottai,et al.  The Search Problem in Mixture Models , 2016, J. Mach. Learn. Res..

[23]  Hadrien Van Lierde,et al.  Learning with fuzzy hypergraphs: A topical approach to query-oriented text summarization , 2019, Inf. Sci..

[24]  Pushpak Bhattacharyya,et al.  Monotone Submodularity in Opinion Summaries , 2015, EMNLP.

[25]  Jeff A. Bilmes,et al.  Submodularity for Data Selection in Machine Translation , 2014, EMNLP.

[26]  Jin Zhang,et al.  An empirical study of sentiment analysis for chinese documents , 2008, Expert Syst. Appl..

[27]  Erik Cambria,et al.  Sentic patterns: Dependency-based rules for concept-level sentiment analysis , 2014, Knowl. Based Syst..

[28]  Tat-Jun Chin,et al.  Coresets for Triangulation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yan Zheng,et al.  Coresets for Kernel Regression , 2017, KDD.

[30]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[31]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[32]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.

[33]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[34]  Trevor Campbell,et al.  Automated Scalable Bayesian Inference via Hilbert Coresets , 2017, J. Mach. Learn. Res..

[35]  David Bamman,et al.  Open Extraction of Fine-Grained Political Statements , 2015, EMNLP.

[36]  Qigang Gao,et al.  An Ensemble Sentiment Classification System of Twitter Data for Airline Services Analysis , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[37]  Ahmed K. Elmagarmid,et al.  Active Learning With Optimal Instance Subset Selection , 2013, IEEE Transactions on Cybernetics.

[38]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[39]  Vahab S. Mirrokni,et al.  Approximating submodular functions everywhere , 2009, SODA.

[40]  Alan Kuhnle Interlaced Greedy Algorithm for Maximization of Submodular Functions in Nearly Linear Time , 2019, NeurIPS.

[41]  Tommy W. S. Chow,et al.  Effective feature selection scheme using mutual information , 2005, Neurocomputing.

[42]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[43]  Maxim Sviridenko,et al.  A note on maximizing a submodular set function subject to a knapsack constraint , 2004, Oper. Res. Lett..

[44]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.