Perspective Identification in Informal Text

This dissertation studies the problem of identifying the ideological perspective of people as expressed in their written text. One's perspective is often expressed in his/her stance towards polarizing topics. We are interested in studying how nuanced linguistic cues can be used to identify the perspective of a person in informal genres. Moreover, we are interested in exploring the problem from a multilingual perspective comparing and contrasting linguistics devices used in both English informal genres datasets discussing American ideological issues and Arabic discussion fora posts related to Egyptian politics. %In doing so, we solve several challenges. Our first and utmost goal is building computational systems that can successfully identify the perspective from which a given informal text is written while studying what linguistic cues work best for each language and drawing insights into the similarities and differences between the notion of perspective in both studied languages. We build computational systems that can successfully identify the stance of a person in English informal text that deal with different topics that are determined by one's perspective, such as legalization of abortion, feminist movement, gay and gun rights; additionally, we are able to identify a more general notion of perspective–namely the 2012 choice of presidential candidate–as well as build systems for automatically identifying different elements of a person's perspective given an Egyptian discussion forum comment. The systems utilize several lexical and semantic features for both languages. Specifically, for English we explore the use of word sense disambiguation, opinion features, latent and frame semantics as well; as Linguistic Inquiry and Word Count features; in Arabic, however, in addition to using sentiment and latent semantics, we study whether linguistic code-switching (LCS) between the standard and dialectal forms for the language can help as a cue for uncovering the perspective from which a comment was written. This leads us to the challenge of devising computational systems that can handle LCS in Arabic. The Arabic language has a diglossic nature where the standard form of the language (MSA) coexists with the regional dialects (DA) corresponding to the native mother tongue of Arabic speakers in different parts of the Arab world. DA is ubiquitously prevalent in written informal genres and in most cases it is code-switched with MSA. The presence of code-switching degrades the performance of almost any MSA-only trained Natural Language Processing tool when applied to DA or to code-switched MSA-DA content. In order to solve this challenge, we build a state-of-the-art system–AIDA–to computationally handle token and sentence-level code-switching. On a conceptual level, for handling and processing Egyptian ideological perspectives, we note the lack of a taxonomy for the most common perspectives among Egyptians and the lack of corresponding annotated corpora. In solving this challenge, we develop a taxonomy for the most common community perspectives among Egyptians and use an iterative feedback-loop process to devise guidelines on how to successfully annotate a given online discussion forum post with different elements of a person's perspective. Using the proposed taxonomy and annotation guidelines, we annotate a large set of Egyptian discussion fora posts to identify a comment's perspective as conveyed in the priority expressed by the comment, as well as the stance on major political entities.

[1]  Philip Resnik,et al.  Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress , 2015, ACL.

[2]  Robert M. Entman,et al.  Framing: Toward Clarification of a Fractured Paradigm , 1993 .

[3]  Xiao Zhang,et al.  pkudblab at SemEval-2016 Task 6 : A Specific Convolutional Neural Network System for Effective Stance Detection , 2016, *SEMEVAL.

[4]  A. Siegel Tweeting Beyond Tahrir : Ideological Diversity and Political Intolerance in Egyptian Twitter Networks , 2014 .

[5]  Amber E. Boydstun,et al.  Identifying Media Frames and Frame Dynamics Within and Across Policy Issues , 2013 .

[6]  Hazem M. Hajj,et al.  A Light Lexicon-based Mobile Application for Sentiment Mining of Arabic Tweets , 2015, ANLP@ACL.

[7]  Noah A. Smith,et al.  SEMAFOR: Frame Argument Resolution with Log-Linear Models , 2010, SemEval@ACL.

[8]  Kalina Bontcheva,et al.  USFD at SemEval-2016 Task 6: Any-Target Stance Detection on Twitter with Autoencoders , 2016, *SEMEVAL.

[9]  Josef Steinberger,et al.  UWB at SemEval-2016 Task 6: Stance Detection , 2016, *SEMEVAL.

[10]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[11]  Chris Callison-Burch,et al.  Ideological Perspective Detection Using Semantic Features , 2015, *SEMEVAL.

[12]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[13]  Weiwei Guo,et al.  Modeling Sentences in the Latent Space , 2012, ACL.

[14]  Soroush Vosoughi,et al.  DeepStance at SemEval-2016 Task 6: Detecting Stance in Tweets Using Character and Word-Level CNNs , 2016, *SEMEVAL.

[15]  Walid Magdy,et al.  Content and Network Dynamics Behind Egyptian Political Polarization on Twitter , 2014, CSCW.

[16]  Philip J. Auter,et al.  Al-Jazeera and Al-Arabiya framing of the Israel–Palestine conflict during war and calm periods , 2013 .

[17]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[18]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[19]  K. T. Poole,et al.  A Spatial Model for Legislative Roll Call Analysis , 1985 .

[20]  Naoaki Okazaki,et al.  Tohoku at SemEval-2016 Task 6: Feature-based Model versus Convolutional Neural Network for Stance Detection , 2016, *SEMEVAL.

[21]  Nizar Habash,et al.  Sentence Level Dialect Identification for Machine Translation System Selection , 2014, ACL.

[22]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[23]  Beata Beigman Klebanov,et al.  Vocabulary Choice as an Indicator of Perspective , 2010, ACL.

[24]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[25]  Chris Dyer,et al.  The CMU Submission for the Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[26]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[27]  Swapna Somasundaran,et al.  Recognizing Stances in Online Debates , 2009, ACL.

[28]  Essam A. H. Mansour The role of social networking sites (SNSs) in the January 25th Revolution in Egypt , 2012 .

[29]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[30]  Jonathan Nagler,et al.  Methodological Challenges in Estimating Tone: Application to News Coverage of the U.S. Economy , 2016 .

[31]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[32]  Mona T. Diab,et al.  CODACT: Towards Identifying Orthographic Variants in Dialectal Arabic , 2011, IJCNLP.

[33]  Braja Gopal Patra,et al.  JU_NLP at SemEval-2016 Task 6: Detecting Stance in Tweets using Support Vector Machines , 2016, *SEMEVAL.

[34]  Noah A. Smith,et al.  Conditional Random Field Autoencoders for Unsupervised Structured Prediction , 2014, NIPS.

[35]  Dragomir R. Radev,et al.  Identifying Opinion Subgroups in Arabic Online Discussions , 2013, ACL.

[36]  Torsten Zesch,et al.  ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classifiers , 2016, *SEMEVAL.

[37]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[38]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[39]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[40]  Hinrich Schütze,et al.  Automatic Detection of Point of View Differences in Wikipedia , 2012, COLING.

[41]  James N. Druckman,et al.  F RAMING T HEORY , 2007 .

[42]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[43]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[44]  Dragomir R. Radev,et al.  Subgroup Detection in Ideological Discussions , 2012, ACL.

[45]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[46]  Sherif Abdou,et al.  A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Nizar Habash,et al.  SPLIT: Smart Preprocessing (Quasi) Language Independent Tool , 2016, LREC.

[48]  Kazi Saidul Predicting Stance in Ideological Debate with Rich Linguistic Knowledge , 2012 .

[49]  Martin Tutek,et al.  TakeLab at SemEval-2016 Task 6: Stance Classification in Tweets Using a Genetic Algorithm Based Ensemble , 2016, *SEMEVAL.

[50]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[51]  Rune Sætre,et al.  IDI$@$NTNU at SemEval-2016 Task 6: Detecting Stance in Tweets Using Shallow Features and GloVe Vectors for Word Representation , 2016, SemEval@NAACL-HLT.

[52]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[53]  Noah A. Smith,et al.  Probabilistic Frame-Semantic Parsing , 2010, NAACL.

[54]  Sandra Kübler,et al.  The IUCL+ System: Word-Level Language Identification via Extended Markov Models , 2014, CodeSwitch@EMNLP.

[55]  Monojit Choudhury,et al.  Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System , 2014, CodeSwitch@EMNLP.

[56]  Noah A. Smith,et al.  Shedding (a Thousand Points of) Light on Biased Language , 2010, Mturk@HLT-NAACL.

[57]  Karin Becker,et al.  INF-UFRGS-OPINION-MINING at SemEval-2016 Task 6: Automatic Generation of a Training Corpus for Unsupervised Identification of Stance in Tweets , 2016, *SEMEVAL.

[58]  Ted Pedersen,et al.  SenseRelate: : TargetWord-A Generalized Framework for Word Sense Disambiguation , 2005, ACL.

[59]  Guido Zarrella,et al.  MITRE at SemEval-2016 Task 6: Transfer Learning for Stance Detection , 2016, *SEMEVAL.

[60]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[61]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[62]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.

[63]  Weiwei Guo,et al.  Genre Independent Subgroup Detection in Online Discussion Threads: A Pilot Study of Implicit Attitud , 2012, ACL 2012.

[64]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[65]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[66]  M. Elmasry Death in the Middle East: An Analysis of How the New York Times and Chicago Tribune Framed Killings in the Second Palestinian Intifada* , 2009 .

[67]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[68]  Amita Misra,et al.  NLDS-UCSC at SemEval-2016 Task 6: A Semi-Supervised Approach to Detecting Stance in Tweets , 2016, *SEMEVAL.

[69]  Jana Diesner,et al.  Using the Semantic-Syntactic Interface for Reliable Arabic Modality Annotation , 2013, IJCNLP.

[70]  Mona T. Diab,et al.  AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic , 2015, CoNLL.

[71]  P. Converse The Nature of Belief Systems in Mass Publics , 2004 .

[72]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[73]  Philip Resnik,et al.  Political Ideology Detection Using Recursive Neural Networks , 2014, ACL.

[74]  Mona T. Diab,et al.  Handling OOV Words in Dialectal Arabic to English Machine Translation , 2014, EMNLP 2014.

[75]  Claudia Gdaniec,et al.  Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation , 2011, SFCM.

[76]  Yue Chen,et al.  IUCL at SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter , 2016, *SEMEVAL.

[77]  Jesse M. Shapiro,et al.  Media Bias and Reputation , 2005, Journal of Political Economy.

[78]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[79]  Heba Elfardy,et al.  AIDA: Automatic Identification and Glossing of Dialectal Arabic , 2012, EAMT.

[80]  Zhihua Zhang,et al.  ECNU at SemEval 2016 Task 6: Relevant or Not? Supportive or Not? A Two-step Learning System for Automatic Detecting Stance in Tweets , 2016, SemEval@NAACL-HLT.

[81]  Miriam R. L. Petruck FRAME SEMANTICS , 1996 .

[82]  Hassan Sajjad,et al.  Verifiably Effective Arabic Dialect Identification , 2014, EMNLP.

[83]  T. V. Dijk,et al.  Ideology: A Multidisciplinary Approach , 1998 .

[84]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[85]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[86]  Mona T. Diab,et al.  Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation , 2011, ALR@IJCNLP.

[87]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[88]  Mona T. Diab,et al.  LILI: A Simple Language Independent Approach for Language Identification , 2016, COLING.

[89]  Mona T. Diab,et al.  Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations , 2012, LREC.

[90]  Wei-Hao Lin,et al.  Which Side are You on? Identifying Perspectives at the Document and Sentence Levels , 2006, CoNLL.

[91]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.