How Short is a Piece of String? : The Impact of Text Length and Text Augmentation on Short-text Classification

Recent increases in the use and availability of short messages have created opportunities to harvest vast amounts of information through machine-based classification. However, traditional classification methods have failed to yield accuracies comparable to classification accuracies on longer texts. Several approaches have previously been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with external data supplementation sources. Existing literature does not precisely describe the impact of text length on classification performance. This work quantitatively examines the changes in accuracy of a small selection of classifiers using a variety of enhancement methods, as text length progressively decreases. Findings, based on ANOVA testing at a 95% confidence interval, suggest that the performance of classifiers using simple enhancements decreases with decreasing text length, but that the use of more sophisticated enhancements risks over-supplementation of the text and consequent concept drift and classification performance decrease as text length increases.

[1]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[2]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[4]  H. Keselman,et al.  Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[7]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[8]  Nadia Magnenat-Thalmann,et al.  Enhancing naive bayes with various smoothing methods for short text classification , 2012, WWW.

[9]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[10]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[11]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[12]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[13]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[14]  Xiaolin Du,et al.  Short Text Classification: A Survey , 2014, J. Multim..

[15]  Jeffrey V. Nickerson,et al.  Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia , 2011, HCI.

[16]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[17]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Zhoujun Li,et al.  Concept-based Short Text Classification and Ranking , 2014, CIKM.

[22]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.