论文信息 - Lexicon Induction for Interpretable Text Classification

Lexicon Induction for Interpretable Text Classification

The automated classification of text documents is an active research challenge in document-oriented information systems, helping users browse massive amounts of data, detecting likely authors of unsigned work, or analyzing large corpora along predefined dimensions of interest such as sentiment or emotion. Existing approaches to text classification tend toward building black-box algorithms, offering accurate classification at the price of not understanding the rationale behind each algorithmic prediction. Lexicon-based classifiers offer an alternative to black-box classifiers by modeling the classification problem with a trivially interpretable classifier. However, current techniques for lexicon-based document classification limit themselves to using either handcrafted lexicons, which suffer from human bias and are difficult to extend, or automatically generated lexicons, which are induced using point-estimates of some predefined probabilistic measure in the corpus of interest. This paper proposes LexicNet, an alternative way of generating high accuracy classification lexicons offering an optimal generalization power without sacrificing model interpretability. We evaluate our approach on two tasks: stance detection and sentiment classification. We find that our lexicon outperforms baseline lexicon induction approaches as well as several standard text classifiers.

Nirmalie Wiratunga | Jérémie Clos

[1] Marilyn A. Walker,et al. A Corpus for Research on Deliberation and Debate , 2012, LREC.

[2] Marshall S. Smith,et al. The general inquirer: A computer approach to content analysis. , 1967 .

[3] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[4] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5] Jörg Kindermann,et al. Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[6] Chew Lim Tan,et al. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[7] Y. Nesterov. Gradient methods for minimizing composite objective function , 2007 .

[8] Stewart Massie,et al. Lexicon based feature extraction for emotion text classification , 2017, Pattern Recognit. Lett..

[9] Philip J. Stone,et al. Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[10] Stewart Massie,et al. Shallow techniques for argument mining , 2016 .

[11] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.