Text Classification for Organizational Researchers

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output.

[1]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Jaime G. Carbonell,et al.  Machine learning research , 1981, SGAR.

[4]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  John F. Kolen,et al.  Backpropagation is Sensitive to Initial Conditions , 1990, Complex Syst..

[7]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[8]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[9]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[10]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[11]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Ian S. Graham The HTML SourceBook , 1995 .

[14]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[15]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[17]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[18]  Cornelis H. A. Koster,et al.  Four text classification algorithms compared on a Dutch corpus , 1998, SIGIR '98.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[21]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[22]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[23]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[24]  Kenji Kita,et al.  Dimensionality reduction using non-negative matrix factorization for information retrieval , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[25]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[26]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[29]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[30]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[31]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[32]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[33]  Shiwen Yu,et al.  An Improved k-Nearest Neighbor Algorithm for Text Categorization , 2003, ArXiv.

[34]  Yiming Yang,et al.  Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization , 2003, ICML.

[35]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[36]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[37]  Wataru Ohyama,et al.  Accuracy improvement of automatic text classification based on feature transformation , 2003, DocEng '03.

[38]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[39]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[40]  Yan-Shi Dong,et al.  A comparison of several ensemble methods for text categorization , 2004, IEEE International Conference onServices Computing, 2004. (SCC 2004). Proceedings. 2004.

[41]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[42]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[43]  Hannu Vanharanta,et al.  Combining data and text mining techniques for analysing financial reports , 2004, Intell. Syst. Account. Finance Manag..

[44]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[45]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[46]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[47]  Shuhai Liu,et al.  A comparative study on text representation schemes in text categorization , 2005, Pattern Analysis and Applications.

[48]  Teresa Gonçalves,et al.  Is linguistic information relevant for the classification of legal texts? , 2005, ICAIL '05.

[49]  Juan M. Corchado,et al.  Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain , 2005, CAEPIA.

[50]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[51]  Hsiu-Fang Hsieh,et al.  Three Approaches to Qualitative Content Analysis , 2005, Qualitative health research.

[52]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[53]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[55]  Peter Willett,et al.  The Porter stemming algorithm: then and now , 2006, Program.

[56]  Yuval Marom,et al.  Experiments with Sentence Classification , 2006, ALTA.

[57]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[58]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[59]  Rhonda K. Reger,et al.  A Content Analysis of the Content Analysis Literature in Organization Studies: Research Themes, Data Sources, and Methodological Refinements , 2007 .

[60]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[61]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[62]  Pasquale Rullo,et al.  Learning rules with negation for text categorization , 2007, SAC '07.

[63]  Wouter van Atteveldt,et al.  Good News or Bad News? Conducting Sentiment Analysis on Dutch Text to Distinguish Between Positive and Negative Relations , 2008 .

[64]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[65]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[66]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[67]  Claire Cardie,et al.  Text Annotation for Political Science Research , 2008 .

[68]  Stefan Kaufmann,et al.  Classifying Party Affiliation from Political Speech , 2008 .

[69]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[70]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[71]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[73]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[74]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[75]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[76]  Carolyn F. Holton,et al.  Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem , 2009, Decis. Support Syst..

[77]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[78]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[79]  D. S. Guru,et al.  Representation and Classification of Text Documents: A Brief Review , 2010 .

[80]  ThelwallMike,et al.  Sentiment strength detection in short informal text , 2010 .

[81]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[82]  Sally A. Goldman,et al.  Computational Learning Theory , 2010, Lecture Notes in Computer Science.

[83]  Murat Can Ganiz,et al.  Analysis of preprocessing methods on classification of Turkish texts , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[84]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[85]  Manuel J. Fonseca,et al.  Automatic Estimation of the LSA Dimension , 2011, KDIR.

[86]  Samuel W. K. Chan,et al.  A text-based decision support system for financial sequence prediction , 2011, Decis. Support Syst..

[87]  Helmut Schneider,et al.  A methodology for comparing classification methods through the assessment of model stability and validity in variable selection , 2011, Decis. Support Syst..

[88]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[89]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[90]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[91]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[92]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[93]  Chia-Hua Ho,et al.  Product Title Classification versus Text Classification , 2012 .

[94]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[95]  Mário A. T. Figueiredo,et al.  Boosting Algorithms: A Review of Methods, Theory, and Applications , 2012 .

[96]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.

[97]  P. K. Panigrahi,et al.  A Comparative Study of Supervised Machine Learning Techniques for Spam E-mail Filtering , 2012, 2012 Fourth International Conference on Computational Intelligence and Communication Networks.

[98]  Roel Popping Qualitative Decisions in Quantitative Text Analysis Research , 2012 .

[99]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[100]  Jacek M. Zurada,et al.  Nonnegative Matrix Factorization and its application to pattern analysis and text mining , 2013, 2013 Federated Conference on Computer Science and Information Systems.

[101]  Brooks C. Holtom,et al.  Even the best laid plans sometimes go askew: career self-management processes, career shocks, and the decision to pursue graduate education. , 2013, The Journal of applied psychology.

[102]  Oliver Brdiczka,et al.  Understanding Email Writers: Personality Prediction from Email Messages , 2013, UMAP.

[103]  Michael Scharkow,et al.  Thematic content analysis using supervised machine learning: An empirical evaluation using German online news , 2011, Quality & Quantity.

[104]  Claire Cardie,et al.  39. Opinion mining and sentiment analysis , 2014 .

[105]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[106]  Abdulmohsen Algarni,et al.  Feature Selection and Term Weighting , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[107]  N. King,et al.  The Utility of Template Analysis in Qualitative Psychology Research , 2014, Qualitative research in psychology.

[108]  Duc-Thuan Vo,et al.  Learning to classify short text from scientific documents using topic models with various types of knowledge , 2015, Expert Syst. Appl..

[109]  Stefan T. Mol,et al.  Automatic Extraction of Nursing Tasks from Online Job Vacancies , 2016 .

[110]  Stefan Trausan-Matu,et al.  Extracting Gamers' Opinions from Reviews , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[111]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[112]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[113]  Samuel W. K. Chan,et al.  Sentiment analysis in financial texts , 2017, Decis. Support Syst..

[114]  Stefan T. Mol,et al.  Text Mining in Organizational Research , 2017, Organizational research methods.

[115]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.