Supervised term weighting for sentiment analysis

Vector space text classification is commonly used in intelligence applications such as email and conversation analysis. In this paper we propose a supervised term weighting scheme called tƒ × KL (term frequency Kullback-Leibler), which weights each word proportionally to the ratio of its document frequency across the positive and negative class. We then generalize tƒ × KL to effectively deal with class imbalance, which is very common in real world intelligence analysis. The generalized tƒ × KL weights each word according to the ratio of the positive and negative class conditioned word probabilities instead of the raw document frequencies. Results on four classification datasets show tƒ × KL to perform consistently better than the baseline tƒ ×idƒ and 4 other supervised term weighting schemes, including the recently proposed tƒ × rƒ (term frequency relevance frequency). The generalized tƒ × KL was found to be extremely robust in dealing with highly skewed class distributions, beating the second runner-up by more than 20% on a dataset that has only 10% positive training examples. The generalized tƒ × KL is thus an effective and robust term weighting scheme that can significantly improve binary classification performance in sentiment analysis and intelligence applications.