Genre and Domain Dependencies in Sentiment Analysis

Genre and domain influence an author’s style of writing and therefore a text’s characteristics. Natural language processing is prone to such variations in textual characteristics: it is said to be genre and domain dependent. This thesis investigates genre and domain dependencies in sentiment analysis. Its goal is to support the development of robust sentiment analysis approaches that work well and in a predictable manner under different conditions, i. e. for different genres and domains. Initially, we show that a prototypical approach to sentiment analysis—viz. a supervised machine learning model based on word n-gram features—performs differently on gold standards that originate from differing genres and domains, but performs similarly on gold standards that originate from resembling genres and domains. We show that these gold standards differ in certain textual characteristics, viz. their domain complexity. We find a strong linear relation between our approach’s accuracy on a particular gold standard and its domain complexity, which we then use to estimate our approach’s accuracy. Subsequently, we use certain textual characteristics—viz. domain complexity, domain similarity, and readability—in a variety of applications. Domain complexity and domain similarity measures are used to determine parameter settings in two tasks. Domain complexity guides us in model selection for in-domain polarity classification, viz. in decisions regarding word n-gram model order and word n-gram feature selection. Domain complexity and domain similarity guide us in domain adaptation. We propose a novel domain adaptation scheme and apply it to cross-domain polarity classification in semiand unsupervised domain adaptation scenarios. Readability is used for feature engineering. We propose to adopt readability gradings, readability indicators as well as word and syntax distributions as features for subjectivity classification. Moreover, we generalize a framework for modeling and representing negation in machine learning-based sentiment analysis. This framework is applied to in-domain and crossdomain polarity classification. We investigate the relation

[1]  J. Jenkins,et al.  Simplification of Flesch Reading Ease Formula. , 1951 .

[2]  Suzanne Stevenson,et al.  Automatically Identifying Changes in the Semantic Orientation of Words , 2010, LREC.

[3]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[4]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[5]  Jorge Carrillo de Albornoz,et al.  An emotion-based model of negation, intensifiers, and modality for polarity and intensity classification , 2013, J. Assoc. Inf. Sci. Technol..

[6]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[7]  Harith Alani,et al.  Alleviating Data Sparsity for Twitter Sentiment Analysis , 2012, #MSM.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[10]  Wessel Kraaij,et al.  A Shallow Approach to Subjectivity Classification , 2008, ICWSM.

[11]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[12]  R. Gunning The Technique of Clear Writing. , 1968 .

[13]  Ari Rappoport,et al.  ICWSM - A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews , 2010, ICWSM.

[14]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[15]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[16]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Claire Cardie,et al.  Identifying Expressions of Opinion in Context , 2007, IJCAI.

[18]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[19]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[20]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[21]  Erik Cambria,et al.  Sentic Activation: A Two-Level Affective Common Sense Reasoning Framework , 2012, AAAI.

[22]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[23]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[24]  Dietrich Klakow,et al.  Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction , 2012, EACL.

[25]  Carlo Strapparava,et al.  Making Computers Laugh: Investigations in Automatic Humor Recognition , 2005, HLT.

[26]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[27]  Claire Cardie,et al.  Learning with Compositional Semantics as Structural Inference for Subsentential Sentiment Analysis , 2008, EMNLP.

[28]  Avishek Saha,et al.  Co-regularization Based Semi-supervised Domain Adaptation , 2010, NIPS.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Takaaki Hasegawa,et al.  Optimizing Informativeness and Readability for Sentiment Summarization , 2010, ACL.

[31]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[32]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[33]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[34]  John S. Caylor,et al.  Methodologies for Determining Reading Requirements Military Occupational Specialties. , 1973 .

[35]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[36]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[37]  Hiroshi Kanayama,et al.  Fully Automatic Lexicon Expansion for Domain-oriented Sentiment Analysis , 2006, EMNLP.

[38]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[39]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[40]  Bruno Pouliquen,et al.  Sentiment Analysis in the News , 2010, LREC.

[41]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[42]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[43]  Hamish Cunningham,et al.  A definition and short history of Language Engineering , 1999, Natural Language Engineering.

[44]  Mike Thelwall,et al.  Biographies or Blenders: Which Resource Is Best for Cross-Domain Sentiment Analysis? , 2012, CICLing.

[45]  Chun Chen,et al.  DASA: Dissatisfaction-oriented Advertising based on Sentiment Analysis , 2010, Expert Syst. Appl..

[46]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[47]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[48]  Iryna Gurevych,et al.  Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields , 2010, EMNLP.

[49]  Barbara Plank,et al.  Effective Measures of Domain Similarity for Parsing , 2011, ACL.

[50]  M. Felisa Verdejo,et al.  Textual Entailment Recognition Based on Dependency Analysis and WordNet , 2005, MLCW.

[51]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[52]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[53]  Edgar A. Smith Devereux Readability Index , 1961 .

[54]  Roser Morante,et al.  A Metalearning Approach to Processing the Scope of Negation , 2009, CoNLL.

[55]  Linh Hoang,et al.  A Model for Evaluating the Quality of User-Created Documents , 2008, AIRS.

[56]  Eva Hudlicka,et al.  To feel or not to feel: The role of affect in human-computer interaction , 2003, Int. J. Hum. Comput. Stud..

[57]  W. A. Sumner,et al.  A recalculation of four adult readability formulas. , 1958 .

[58]  Gerard J. Steen Genres of discourse and the definition of literature , 1999 .

[59]  Isa Maks,et al.  Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? , 2013, RANLP.

[60]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[61]  Robert Remus Modeling and Representing Negation in Data-driven Machine Learning-based Sentiment Analysis , 2013, ESSEM@AI*IA.

[62]  Dietrich Klakow,et al.  Convolution Kernels for Opinion Holder Extraction , 2010, NAACL.

[63]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[64]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[65]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[66]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[67]  Clement T. Yu,et al.  The effect of negation on sentiment analysis and retrieval effectiveness , 2009, CIKM.

[68]  Zhi-Hua Zhou,et al.  Distributional Features for Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[69]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[70]  Eduard Hovy,et al.  Identifying Opinion Holders for Question Answering in Opinion Texts , 2005 .

[71]  Marko Grobelnik,et al.  Interaction of Feature Selection Methods and Linear Classification Models , 2002 .

[72]  Iryna Gurevych,et al.  Sentence and Expression Level Annotation of Opinions in User-Generated Discourse , 2010, ACL.

[73]  Noam Chomsky,et al.  The faculty of language: what is it, who has it, and how did it evolve? , 2002, Science.

[74]  Eduard Hovy,et al.  Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text , 2006 .

[75]  Michael Halliday,et al.  Language as system and language as instance: The corpus as a theoretical construct , 1992 .

[76]  P. Holland,et al.  Robust regression using iteratively reweighted least-squares , 1977 .

[77]  Shlomo Argamon,et al.  Extracting Appraisal Expressions , 2007, NAACL.

[78]  W. Bruce Croft,et al.  Computing Attitude and Affect in Text : , 2006 .

[79]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[80]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[81]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[82]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[83]  Dong Wang,et al.  A Cross-corpus Study of Unsupervised Subjectivity Identification based on Calibrated EM , 2011, WASSA@ACL.

[84]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[85]  Kentaro Inui,et al.  Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables , 2010, NAACL.

[86]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[87]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[88]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[89]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[90]  Erik Cambria,et al.  SenticNet 2: A Semantic and Affective Resource for Opinion Mining and Sentiment Analysis , 2012, FLAIRS.

[91]  Siddharth Patwardhan,et al.  Feature Subsumption for Opinion Analysis , 2006, EMNLP.

[92]  Geoffrey K. Pullum,et al.  Recursion and the infinitude claim , 2010 .

[93]  J A H R Claassen The gold standard: not a golden standard , 2005, BMJ : British Medical Journal.

[94]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[95]  Pablo Gervás,et al.  A Hybrid Approach to Emotional Sentence Polarity and Intensity Classification , 2010, CoNLL.

[96]  Shibamouli Lahiri,et al.  Informality Judgment at Sentence Level and Experiments with Formality Score , 2011, CICLing.

[97]  Rodolfo Delmonte VENSES - A Linguistically-Based System for Semantic Evaluation , 2005, Proces. del Leng. Natural.

[98]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[99]  Robert Remus,et al.  Learning from Domain Complexity , 2014, LREC.

[100]  Songbo Tan,et al.  A novel scheme for domain-transfer problem in the context of sentiment analysis , 2007, CIKM '07.

[101]  Mitsuru Ishizuka,et al.  Compositionality Principle in Recognition of Fine-Grained Emotions from Text , 2009, ICWSM.

[102]  Antonio R. Damasio,et al.  Emotions and Feelings , 2004 .

[103]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[104]  Lilja Øvrelid,et al.  Representing and Resolving Negation for Sentiment Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[105]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[106]  Manfred Klenner,et al.  Robust Compositional Polarity Classification , 2009, RANLP.

[107]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[108]  Irving E. Fang,et al.  The “Easy listening formula” , 1966 .

[109]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[110]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[111]  Sivaji Bandyopadhyay,et al.  Subjectivity Detection using Genetic Algorithm , 2010 .

[112]  Noriko Kando,et al.  Multi-Document Summarization with Subjectivity Analysis at DUC 2005 , 2005 .

[113]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[114]  Daniel A. Keim,et al.  Visual readability analysis: How to make your writings easier to read , 2010, IEEE VAST.

[115]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[116]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[117]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[118]  Robert Remus,et al.  Domain Adaptation Using Domain Similarity- and Domain Complexity-Based Instance Selection for Cross-Domain Sentiment Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[119]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[120]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[121]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[122]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[123]  Stefan Conrad,et al.  Integrating viewpoints into newspaper opinion mining for a media response analysis , 2012, KONVENS.

[124]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[125]  Rada Mihalcea,et al.  Characterizing Humour: An Exploration of Features in Humorous Texts , 2009, CICLing.

[126]  Richard Johansson,et al.  Relational Features in Fine-Grained Opinion Analysis , 2013, CL.

[127]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[128]  Robert Remus Improving Sentence-level Subjectivity Classification through Readability Measurement , 2011, NODALIDA.

[129]  Barry Smyth,et al.  The Readability of Helpful Product Reviews , 2010, FLAIRS Conference.

[130]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[131]  Adam Kilgarriff,et al.  Measures for Corpus Similarity and Homogeneity , 1998, EMNLP.

[132]  Nicolas Nicolov,et al.  Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations , 2009, ICWSM.

[133]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[134]  Lillian Lee,et al.  On the effectiveness of the skew divergence for statistical language analysis , 2001, AISTATS.

[135]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[136]  Vaibhavi N Patodkar,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2016 .

[137]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[138]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[139]  Kathleen R. McKeown,et al.  Predicting the semantic orientation of adjectives , 1997 .

[140]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[141]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[142]  Christian Hänig,et al.  Towards Well-Grounded Phrase-Level Polarity Analysis , 2011, CICLing.

[143]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[144]  Chengqing Zong,et al.  Multi-domain Sentiment Classification , 2008, ACL.

[145]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[146]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[147]  Karo Moilanen,et al.  Sentiment Composition , 2007 .

[148]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[149]  Jeonghee Yi,et al.  Sentiment analysis: capturing favorability using natural language processing , 2003, K-CAP '03.

[150]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[151]  Hinrich Schütze,et al.  Unsupervised sentiment analysis with a simple and fast Bayesian model using Part-of-Speech feature selection , 2012, KONVENS.

[152]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[153]  A. Rényi On Measures of Entropy and Information , 1961 .

[154]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[155]  Claire Cardie,et al.  Joint Extraction of Entities and Relations for Opinion Recognition , 2006, EMNLP.

[156]  Paolo Rosso,et al.  On the difficulty of automatically detecting irony: beyond a simple case of negation , 2014, Knowledge and Information Systems.

[157]  Ellen Riloff,et al.  Creating Subjective and Objective Sentence Classifiers from Unannotated Texts , 2005, CICLing.

[158]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[159]  Mike Thelwall,et al.  Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification , 2012, EMNLP.

[160]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[161]  Daumé,et al.  Frustratingly Easy Semi-Supervised Domain Adaptation , 2010 .

[162]  Eugene Charniak,et al.  Variation of Entropy and Parse Trees of Sentences as a Function of the Sentence Number , 2003, EMNLP.

[163]  Arno Scharl,et al.  Cross-Domain Contextualization of Sentiment Lexicons , 2010, ECAI.

[164]  Walter Daelemans,et al.  Using Domain Similarity for Performance Estimation , 2010, ACL 2010.

[165]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[166]  Paolo Rosso,et al.  Making objective decisions from subjective data: Detecting irony in customer reviews , 2012, Decis. Support Syst..

[167]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[168]  Uzay Kaymak,et al.  Determining negation scope and strength in sentiment analysis , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.