Data-based prediction of sentiments using heterogeneous model ensembles

In this paper, we present an ensemble modeling approach for sentiment analysis using machine learning algorithms. The main goal of sentiment analysis is to develop estimators that are able to identify the sentiment orientation (positive, negative, or neutral) of sentences found in any arbitrary source. The novel approach presented here relies on the analysis of the words found in sentences and the formation of large sets of heterogeneous models, i.e., binary as well as multi-class classification models that are calculated by various different machine learning methods; these models shall represent the relationship between the presence of given words (or combination of words) and sentiments. All models trained during the learning phase are applied during the test phase and the final sentiment assessment is annotated with a confidence value that specifies, how reliable the models are regarding the presented decision. In the empirical part of this paper, we show results achieved using a German corpus of Amazon recensions and a set of machine learning methods (decision trees and adaptive boosting, Gaussian processes, random forests, k-nearest neighbor classification, support vector machines and artificial neural networks with evolutionary feature and parameter optimization, and genetic programming). Using a heterogeneous model ensemble learning approach that combines multi-class classifiers as well as binary classifiers, the classification accuracy can be increased significantly and the ratio of totally wrongly classified samples (i.e., those that are assigned to the completely opposite sentiment orientation) can be decreased significantly.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[4]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[5]  Stefan Wagner,et al.  SexualGA: Gender-Specific Selection for Genetic Algorithms , 2005 .

[6]  Stephan M. Winkler,et al.  Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications , 2009 .

[7]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[8]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[9]  William R. Hersh,et al.  Information Retrieval: A Health and Biomedical Perspective , 2002 .

[10]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[11]  Suad Alhojely,et al.  Sentiment Analysis and Opinion Mining: A Survey , 2016 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[15]  Weifeng Liu,et al.  Kernel Adaptive Filtering , 2010 .

[16]  Shlomo Argamon,et al.  Using appraisal groups for sentiment analysis , 2005, CIKM '05.

[17]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[18]  Stephan M. Winkler,et al.  On Text Preprocessing for Opinion Mining Outside of Laboratory Environments , 2012, AMT.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[21]  O. Nelles Nonlinear System Identification , 2001 .

[22]  Weifeng Liu,et al.  Kernel Adaptive Filtering: A Comprehensive Introduction , 2010 .

[23]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[24]  Michael L. Stein,et al.  Interpolation of spatial data , 1999 .

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[27]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[28]  Philip S. Yu,et al.  A holistic lexicon-based approach to opinion mining , 2008, WSDM '08.

[29]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[30]  Marie-Francine Moens,et al.  Automatic Sentiment Analysis in On-line Text , 2007, ELPUB.

[31]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[32]  Ian Witten,et al.  Data Mining , 2000 .

[33]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[34]  Enrique Alba,et al.  Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms , 2007, 2007 IEEE Congress on Evolutionary Computation.

[35]  Bing Liu,et al.  The utility of linguistic rules in opinion mining , 2007, SIGIR.

[36]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[37]  Ian H. Witten,et al.  Chapter 1 – What's It All About? , 2011 .

[38]  Witold Jacak,et al.  Identification of cancer diagnosis estimation models using evolutionary algorithms: a case study for breast cancer, melanoma, and cancer in the respiratory system , 2011, GECCO.

[39]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[40]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[41]  Michael Affenzeller,et al.  SASEGASA: A New Generic Parallel Evolutionary Algorithm for Achieving Highest Quality Results , 2004, J. Heuristics.

[42]  Witold Jacak,et al.  Feature selection in the analysis of tumor marker data using evolutionary algorithms , 2010 .

[43]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[44]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.