Lexicon Induction for Interpretable Text Classification

The automated classification of text documents is an active research challenge in document-oriented information systems, helping users browse massive amounts of data, detecting likely authors of unsigned work, or analyzing large corpora along predefined dimensions of interest such as sentiment or emotion. Existing approaches to text classification tend toward building black-box algorithms, offering accurate classification at the price of not understanding the rationale behind each algorithmic prediction. Lexicon-based classifiers offer an alternative to black-box classifiers by modeling the classification problem with a trivially interpretable classifier. However, current techniques for lexicon-based document classification limit themselves to using either handcrafted lexicons, which suffer from human bias and are difficult to extend, or automatically generated lexicons, which are induced using point-estimates of some predefined probabilistic measure in the corpus of interest. This paper proposes LexicNet, an alternative way of generating high accuracy classification lexicons offering an optimal generalization power without sacrificing model interpretability. We evaluate our approach on two tasks: stance detection and sentiment classification. We find that our lexicon outperforms baseline lexicon induction approaches as well as several standard text classifiers.

[1]  Marilyn A. Walker,et al.  A Corpus for Research on Deliberation and Debate , 2012, LREC.

[2]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[3]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[6]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[7]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[8]  Stewart Massie,et al.  Lexicon based feature extraction for emotion text classification , 2017, Pattern Recognit. Lett..

[9]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[10]  Stewart Massie,et al.  Shallow techniques for argument mining , 2016 .

[11]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[12]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[13]  Paul Thomas,et al.  Unifying Local and Global Agreement and Disagreement Classification in Online Debates , 2012, WASSA@ACL.

[14]  Nirmalie Wiratunga,et al.  A Hybrid Sentiment Lexicon for Social Media Mining , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[15]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[16]  Claire Cardie,et al.  Improving Agreement and Disagreement Identification in Online Discussions with A Socially-Tuned Sentiment Lexicon , 2014, WASSA@ACL.

[17]  Dacheng Tao,et al.  Shakeout: A New Regularized Deep Neural Network Training Scheme , 2016, AAAI.

[18]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[19]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[20]  Nirmalie Wiratunga,et al.  Contextual sentiment analysis for social media genres , 2016, Knowl. Based Syst..

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Stewart Massie,et al.  Generating a Word-Emotion Lexicon from #Emotional Tweets , 2014, *SEMEVAL.

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[25]  Guillaume Cabanac,et al.  Predicting Emotional Reaction in Social Networks , 2017, ECIR.

[26]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[27]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[28]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.