A machine learning approach to sentiment analysis in multilingual Web texts

Sentiment analysis, also called opinion mining, is a form of information extraction from text of growing research and commercial interest. In this paper we present our machine learning experiments with regard to sentiment analysis in blog, review and forum texts found on the World Wide Web and written in English, Dutch and French. We train from a set of example sentences or statements that are manually annotated as positive, negative or neutral with regard to a certain entity. We are interested in the feelings that people express with regard to certain consumption products. We learn and evaluate several classification models that can be configured in a cascaded pipeline. We have to deal with several problems, being the noisy character of the input texts, the attribution of the sentiment to a particular entity and the small size of the training set. We succeed to identify positive, negative and neutral feelings to the entity under consideration with ca. 83% accuracy for English texts based on unigram features augmented with linguistic features. The accuracy results of processing the Dutch and French texts are ca. 70 and 68% respectively due to the larger variety of the linguistic expressions that more often diverge from standard language, thus demanding more training patterns. In addition, our experiments give us insights into the portability of the learned models across domains and languages. A substantial part of the article investigates the role of active learning techniques for reducing the number of examples to be manually annotated.

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Marie-Francine Moens,et al.  Generating a Topic Hierarchy from Dialect Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[3]  Michael Gamon,et al.  Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis , 2004, COLING.

[4]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[5]  Bo Wang,et al.  Bootstrapping both Product Properties and Opinion Words from Chinese Reviews with Cross-Training , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[6]  Vasileios Hatzivassiloglou,et al.  Predicting the Semantic Orientation of Adjectives , 1997, ACL.

[7]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[8]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[9]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[10]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[11]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[12]  Yuji Matsumoto,et al.  Extracting Aspect-Evaluation and Aspect-Of Relations in Opinion Mining , 2007, EMNLP.

[13]  Ted Pedersen,et al.  A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[14]  Edward Y. Chang,et al.  Active Learning for Interactive Multimedia Retrieval , 2008, Proceedings of the IEEE.

[15]  Anton Nijholt,et al.  A Lexical Grammatical Implementation of Affect , 2004, TSD.

[16]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[17]  Antinus Nijholt,et al.  Humor and embodied conversational agents , 2003 .

[18]  Casey Whitelaw Using Appraisal Taxonomies for Sentiment Analysis , 2005 .

[19]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[20]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[21]  John Carroll,et al.  Unsupervised Classification of Sentiment and Objectivity in Chinese Text , 2008, IJCNLP.

[22]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[23]  Elizabeth Chang,et al.  Intelligent Web Services Selection based on AHP and Wiki , 2007 .

[24]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[25]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[26]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[27]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[28]  Annie Zaenen,et al.  Contextual Valence Shifters , 2006, Computing Attitude and Affect in Text.

[29]  J. Kamps,et al.  Words with attitude , 2002 .

[30]  Elmar Nöth,et al.  Recognition of emotion in a realistic dialogue scenario , 2000, INTERSPEECH.

[31]  Kathleen McKeown,et al.  Lexicalized Markov Grammars for Sentence Compression , 2007, NAACL.

[32]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[33]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[34]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[35]  Stan Szpakowicz,et al.  Using Roget’s Thesaurus for Fine-grained Emotion Recognition , 2008, IJCNLP.

[36]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[37]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[38]  Rohini K. Srihari,et al.  Using Verbs and Adjectives to Automatically Classify Blog Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[39]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[40]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[41]  Franco Salvetti,et al.  Impact of lexical filtering on overall opinion polarity identification , 2005, AAAI 2005.

[42]  Tong Zhang,et al.  Active learning using adaptive resampling , 2000, KDD '00.

[43]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[44]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[45]  Rob Malouf,et al.  A Preliminary Investigation into Sentiment Analysis of Informal Political Discourse , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[46]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[47]  Nathanael Chambers,et al.  Approaches for Automatically Tagging Affect: Steps Toward an Effective and Efficient Tool , 2006, Computing Attitude and Affect in Text.

[48]  Yi Zhang,et al.  UCSC on REC 2006 Blog Opinion Mining , 2006, TREC.

[49]  Ronald R. Yager,et al.  Characterizing Buzz and Sentiment in Internet Sources: Linguistic Summaries and Predictive Behaviors , 2006, Computing Attitude and Affect in Text.

[50]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[51]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[52]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[53]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[54]  Sajid Hussain,et al.  Using Received Signal Strength Variation for Energy Efficient Data Dissemination in Wireless Sensor Networks , 2007 .

[55]  Aidan Finn,et al.  Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[56]  Joseph Kaye,et al.  Understanding how bloggers feel: recognizing affect in blog posts , 2006, CHI Extended Abstracts.

[57]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[58]  Marti A. Hearst Direction-based text interpretation as an information access refinement , 1992 .

[59]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[60]  Noriko Kando,et al.  Certainty Identification in Texts: Categorization Model and Manual Tagging Results , 2023 .

[61]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[62]  Edoardo M. Airoldi,et al.  On Learning Parsimonious Models for Extracting Consumer Opinions , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[63]  Vincent Lemaire,et al.  Active Learning Strategies: A Case Study for Detection of Emotions in Speech , 2007, ICDM.

[64]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[65]  Jack G. Conrad,et al.  Opinion mining in legal blogs , 2007, ICAIL.

[66]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[67]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[68]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[69]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[70]  Janyce Wiebe,et al.  Effects of Adjective Orientation and Gradability on Sentence Subjectivity , 2000, COLING.

[71]  Nigel Collier,et al.  Sentiment Analysis using Support Vector Machines with Diverse Information Sources , 2004, EMNLP.