The influence of preprocessing on text classification using a bag-of-words representation

Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword removal, punctuation mark removal, lemmatization, correction of common misspelled words, and reduction of replicated characters. We hypothesize that the application of different combinations of preprocessing methods can improve TC results. Therefore, we performed an extensive and systematic set of TC experiments (and this is our main research contribution) to explore the impact of all possible combinations of five/six basic preprocessing methods on four benchmark text corpora (and not samples of them) using three ML methods and training and test sets. The general conclusion (at least for the datasets verified) is that it is always advisable to perform an extensive and systematic variety of preprocessing methods combined with TC experiments because it contributes to improve TC accuracy. For all the tested datasets, there was always at least one combination of basic preprocessing methods that could be recommended to significantly improve the TC using a BOW representation. For three datasets, stopword removal was the only single preprocessing method that enabled a significant improvement compared to the baseline result using a bag of 1,000-word unigrams. For some of the datasets, there was minimal improvement when we removed HTML tags, performed spelling correction or removed punctuation marks, and reduced replicated characters. However, for the fourth dataset, the stopword removal was not beneficial. Instead, the conversion of uppercase letters into lowercase letters was the only single preprocessing method that demonstrated a significant improvement compared to the baseline result. The best result for this dataset was obtained when we performed spelling correction and conversion into lowercase letters. In general, for all the datasets processed, there was always at least one combination of basic preprocessing methods that could be recommended to improve the accuracy results when using a bag-of-words representation.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[3]  Guanzheng Tan,et al.  The Effect of Preprocessing on Arabic Document Categorization , 2016, Algorithms.

[4]  Juan M. Corchado,et al.  Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain , 2005, CAEPIA.

[5]  E. Jaynes,et al.  NOTES ON PRESENT STATUS AND FUTURE PROSPECTS , 1991 .

[6]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[7]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[8]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[9]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[10]  Olivier Pourret,et al.  Bayesian networks : a practical guide to applications , 2008 .

[11]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[12]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[13]  Yiannis Kompatsiaris,et al.  Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features , 2015, International KEYSTONE Conference.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[16]  David Heckerman,et al.  Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.

[17]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[18]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[19]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[20]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[21]  Yaakov HaCohen-Kerner,et al.  Summarization of Jewish Law Articles in Hebrew , 2003, CAINE.

[22]  Maria Virvou,et al.  Comparative Evaluation of Algorithms for Sentiment Analysis over Social Networking Services , 2017, J. Univers. Comput. Sci..

[23]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[25]  Maria Virvou,et al.  The effect of preprocessing techniques on Twitter sentiment analysis , 2016, 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA).

[26]  Misha Denil,et al.  From Group to Individual Labels Using Deep Features , 2015, KDD.

[27]  Eugénio C. Oliveira,et al.  The Impact of Pre-processing on the Classification of MEDLINE Documents , 2010, PRIS.

[28]  Kevin Knight,et al.  Mining online text , 1999, Commun. ACM.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[30]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[31]  Yaakov HaCohen-Kerner,et al.  Topic-based Classification through Unigram Unmasking , 2018, KES.

[32]  Radim Řehůřek,et al.  The Influence of Preprocessing Parameters on TextCategorization , 2007 .

[33]  Yaakov HaCohen-Kerner,et al.  WORDS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND THE ETHNIC ORIGIN OF THEIR AUTHORS , 2008, Cybern. Syst..

[34]  Yong Shi,et al.  The Role of Text Pre-processing in Sentiment Analysis , 2013, ITQM.

[35]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[36]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[37]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[38]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[39]  Gui Xiaolin,et al.  Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis , 2017, IEEE Access.

[40]  Yaakov HaCohen-Kerner,et al.  Cuisine: Classification using stylistic feature sets and/or name-based feature sets , 2010, J. Assoc. Inf. Sci. Technol..

[41]  Shuhai Liu,et al.  A comparative study on text representation schemes in text categorization , 2005, Pattern Analysis and Applications.

[42]  Yaakov HaCohen-Kerner,et al.  STYLISTIC FEATURE SETS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND ETHNIC ORIGIN , 2010, Appl. Artif. Intell..

[43]  Ronald Christensen,et al.  Log-Linear Models and Logistic Regression , 1997 .

[44]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[45]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[46]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Caroline Tagg,et al.  A corpus linguistics study of SMS text messaging , 2009 .

[49]  Yaakov HaCohen-Kerner,et al.  The Impact of Preprocessing on the Classification of Mental Disorders , 2019, ICDM.