论文信息 - How Short is a Piece of String? : The Impact of Text Length and Text Augmentation on Short-text Classification

How Short is a Piece of String? : The Impact of Text Length and Text Augmentation on Short-text Classification

Recent increases in the use and availability of short messages have created opportunities to harvest vast amounts of information through machine-based classification. However, traditional classification methods have failed to yield accuracies comparable to classification accuracies on longer texts. Several approaches have previously been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with external data supplementation sources. Existing literature does not precisely describe the impact of text length on classification performance. This work quantitatively examines the changes in accuracy of a small selection of classifiers using a variety of enhancement methods, as text length progressively decreases. Findings, based on ANOVA testing at a 95% confidence interval, suggest that the performance of classifiers using simple enhancements decreases with decreasing text length, but that the use of more sophisticated enhancements risks over-supplementation of the text and consequent concept drift and classification performance decrease as text length increases.

Luca Longo | Svetlana Hensman | Austin Mccartney

[1] Evgeniy Gabrilovich,et al. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[2] Hae-Chang Rim,et al. Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3] Peter Norvig,et al. The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[4] H. Keselman,et al. Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[5] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[6] Somnath Banerjee,et al. Clustering short texts using wikipedia , 2007, SIGIR.

[7] Patrick F. Reidy. An Introduction to Latent Semantic Analysis , 2009 .

[8] Nadia Magnenat-Thalmann,et al. Enhancing naive bayes with various smoothing methods for short text classification , 2012, WWW.

[9] Aixin Sun,et al. Short text classification using very few words , 2012, SIGIR '12.

[10] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[11] Rich Caruana,et al. An empirical comparison of supervised learning algorithms , 2006, ICML.