Text Classification for Organizational Researchers

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output.

[1]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[2]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[3]  Yuval Marom,et al.  Experiments with Sentence Classification , 2006, ALTA.

[4]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[5]  Abdulmohsen Algarni,et al.  Feature Selection and Term Weighting , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[6]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[7]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[8]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[9]  ThelwallMike,et al.  Sentiment strength detection in short informal text , 2010 .

[10]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Chia-Hua Ho,et al.  Product Title Classification versus Text Classification , 2012 .

[13]  Jacek M. Zurada,et al.  Nonnegative Matrix Factorization and its application to pattern analysis and text mining , 2013, 2013 Federated Conference on Computer Science and Information Systems.

[14]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[15]  Claire Cardie,et al.  39. Opinion mining and sentiment analysis , 2014 .

[16]  D. S. Guru,et al.  Representation and Classification of Text Documents: A Brief Review , 2010 .

[17]  Brooks C. Holtom,et al.  Even the best laid plans sometimes go askew: career self-management processes, career shocks, and the decision to pursue graduate education. , 2013, The Journal of applied psychology.

[18]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[19]  Samuel W. K. Chan,et al.  Sentiment analysis in financial texts , 2017, Decis. Support Syst..

[20]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[21]  Kenji Kita,et al.  Dimensionality reduction using non-negative matrix factorization for information retrieval , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[22]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[23]  M. Kubát An Introduction to Machine Learning , 2017, Springer International Publishing.

[24]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[25]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[26]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[27]  Sally A. Goldman,et al.  Computational Learning Theory , 2010, Lecture Notes in Computer Science.

[28]  Pasquale Rullo,et al.  Learning rules with negation for text categorization , 2007, SAC '07.

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  Murat Can Ganiz,et al.  Analysis of preprocessing methods on classification of Turkish texts , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[31]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[32]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[33]  Carolyn F. Holton,et al.  Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem , 2009, Decis. Support Syst..

[34]  Wouter van Atteveldt,et al.  Good News or Bad News? Conducting Sentiment Analysis on Dutch Text to Distinguish Between Positive and Negative Relations , 2008 .

[35]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[36]  Shuhai Liu,et al.  A comparative study on text representation schemes in text categorization , 2005, Pattern Analysis and Applications.

[37]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[38]  Peter Willett,et al.  The Porter stemming algorithm: then and now , 2006, Program.

[39]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[40]  Mário A. T. Figueiredo,et al.  Boosting Algorithms: A Review of Methods, Theory, and Applications , 2012 .

[41]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[42]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[43]  Rhonda K. Reger,et al.  A Content Analysis of the Content Analysis Literature in Organization Studies: Research Themes, Data Sources, and Methodological Refinements , 2007 .

[44]  Roel Popping Qualitative Decisions in Quantitative Text Analysis Research , 2012 .

[45]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[46]  Helmut Schneider,et al.  A methodology for comparing classification methods through the assessment of model stability and validity in variable selection , 2011, Decis. Support Syst..

[47]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[48]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[49]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[50]  Michael Scharkow,et al.  Thematic content analysis using supervised machine learning: An empirical evaluation using German online news , 2011, Quality & Quantity.

[51]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[52]  Ian S. Graham The HTML SourceBook , 1995 .

[53]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[54]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[55]  Hannu Vanharanta,et al.  Combining data and text mining techniques for analysing financial reports , 2004, Intell. Syst. Account. Finance Manag..

[56]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[57]  Samuel W. K. Chan,et al.  A text-based decision support system for financial sequence prediction , 2011, Decis. Support Syst..

[58]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[59]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[60]  Yiming Yang,et al.  Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization , 2003, ICML.

[61]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[62]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[63]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[64]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[65]  Manuel J. Fonseca,et al.  Automatic Estimation of the LSA Dimension , 2011, KDIR.

[66]  Wataru Ohyama,et al.  Accuracy improvement of automatic text classification based on feature transformation , 2003, DocEng '03.

[67]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[68]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[69]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[70]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[71]  Oliver Brdiczka,et al.  Understanding Email Writers: Personality Prediction from Email Messages , 2013, UMAP.

[72]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[73]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[74]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[75]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[76]  Yan-Shi Dong,et al.  A comparison of several ensemble methods for text categorization , 2004, IEEE International Conference onServices Computing, 2004. (SCC 2004). Proceedings. 2004.

[77]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[78]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[79]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[80]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[81]  Stefan T. Mol,et al.  Automatic Extraction of Nursing Tasks from Online Job Vacancies , 2016 .

[82]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[83]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[84]  Juan M. Corchado,et al.  Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain , 2005, CAEPIA.

[85]  Hsiu-Fang Hsieh,et al.  Three Approaches to Qualitative Content Analysis , 2005, Qualitative health research.

[86]  Cornelis H. A. Koster,et al.  Four text classification algorithms compared on a Dutch corpus , 1998, SIGIR '98.

[87]  Claire Cardie,et al.  Text Annotation for Political Science Research , 2008 .

[88]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[89]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[90]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[91]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[92]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[94]  P. K. Panigrahi,et al.  A Comparative Study of Supervised Machine Learning Techniques for Spam E-mail Filtering , 2012, 2012 Fourth International Conference on Computational Intelligence and Communication Networks.

[95]  Stefan Kaufmann,et al.  Classifying Party Affiliation from Political Speech , 2008 .

[96]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[97]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[98]  Duc-Thuan Vo,et al.  Learning to classify short text from scientific documents using topic models with various types of knowledge , 2015, Expert Syst. Appl..

[99]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.

[100]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[101]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[102]  N. King,et al.  The Utility of Template Analysis in Qualitative Psychology Research , 2014, Qualitative research in psychology.

[103]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[104]  Shiwen Yu,et al.  An Improved k-Nearest Neighbor Algorithm for Text Categorization , 2003, ArXiv.

[105]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[106]  Jaime G. Carbonell,et al.  Machine learning research , 1981, SGAR.

[107]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[108]  John F. Kolen,et al.  Backpropagation is Sensitive to Initial Conditions , 1990, Complex Syst..

[109]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[110]  Stefan T. Mol,et al.  Text Mining in Organizational Research , 2017, Organizational research methods.

[111]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[112]  Stefan Trausan-Matu,et al.  Extracting Gamers' Opinions from Reviews , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[113]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[114]  Teresa Gonçalves,et al.  Is linguistic information relevant for the classification of legal texts? , 2005, ICAIL '05.

[115]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.