An ensemble scheme based on language function analysis and feature engineering for text genre classification

Text genre classification is the process of identifying functional characteristics of text documents. The immense quantity of text documents available on the web can be properly filtered, organised and retrieved with the use of text genre classification, which may have potential use on several other tasks of natural language processing and information retrieval. Genre may refer to several aspects of text documents, such as function and purpose. The language function analysis (LFA) concentrates on single aspect of genres and it aims to classify text documents into three abstract classes, such as expressive, appellative and informative. Text genre classification is typically performed by supervised machine learning algorithms. The extraction of an efficient feature set to represent text documents is an essential task for building a robust classification scheme with high predictive performance. In addition, ensemble learning, which combines the outputs of individual classifiers to obtain a robust classification scheme, is a promising research field in machine learning research. In this regard, this article presents an extensive comparative analysis of different feature engineering schemes (such as features used in authorship attribution, linguistic features, character n-grams, part of speech n-grams and the frequency of the most discriminative words) and five different base learners (Naïve Bayes, support vector machines, logistic regression, k-nearest neighbour and Random Forest) in conjunction with ensemble learning methods (such as Boosting, Bagging and Random Subspace). Based on the empirical analysis, an ensemble classification scheme is presented, which integrates Random Subspace ensemble of Random Forest with four types of features (features used in authorship attribution, character n-grams, part of speech n-grams and the frequency of the most discriminative words). For LFA corpus, the highest average predictive performance obtained by the proposed scheme is 94.43%.

[1]  Virgínia Maria Vasconcelos Leal,et al.  Textual genres on discourse analysis and translation functionalism , 2010 .

[2]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[3]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[5]  Aytug Onan,et al.  Classifier and feature set ensembles for web page classification , 2016, J. Inf. Sci..

[6]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[7]  Zhu Zhang,et al.  POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis , 2015, Inf. Process. Manag..

[8]  Jin-Cheon Na,et al.  Effectiveness of web search results for genre and sentiment classification , 2009, J. Inf. Sci..

[9]  Mark A. Rosso User-based identification of Web genres , 2008, J. Assoc. Inf. Sci. Technol..

[10]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[11]  Yuen-Hsien Tseng,et al.  Patent surrogate extraction and evaluation in the context of patent mapping , 2007, J. Inf. Sci..

[12]  Seda Özmutlu,et al.  Character n-gram application for automatic new topic identification , 2014, Inf. Process. Manag..

[13]  Alexander Mehler,et al.  Riding the Rough Waves of Genre on the Web , 2011, Genres on the Web.

[14]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[15]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[16]  George Ferizis,et al.  Towards practical genre classification of web documents , 2006, WWW '06.

[17]  Aytug Onan,et al.  A feature selection model based on genetic rank aggregation for text sentiment classification , 2017, J. Inf. Sci..

[18]  James A. Rodger,et al.  A fuzzy nearest neighbor neural network statistical model for predicting demand for natural gas and energy cost savings in public buildings , 2014, Expert Syst. Appl..

[19]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[20]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[21]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Elin K. Jacob,et al.  An investigation of the levels of abstraction of tags across three resource genres , 2016, Inf. Process. Manag..

[25]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[26]  Yaakov HaCohen-Kerner,et al.  Cuisine: Classification using stylistic feature sets and/or name-based feature sets , 2010, J. Assoc. Inf. Sci. Technol..

[27]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[28]  Henning Wachsmuth,et al.  Back to the Roots of Genres: Text Classification by Language Function , 2011, IJCNLP.

[29]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[30]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[31]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[32]  Miguel A. Alonso,et al.  A linguistic approach for determining the topics of Spanish Twitter messages , 2015, J. Inf. Sci..

[33]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[34]  Michal Cutler,et al.  Cost-Sensitive Feature Extraction and Selection in Genre Classification , 2009, J. Lang. Technol. Comput. Linguistics.

[35]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[36]  M. Santini,et al.  Automatic text analysis: gradations of text types in web pages , 2005 .

[37]  Kevin Crowston,et al.  Genre based navigation on the Web , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[38]  Katharina Reiss,et al.  Fundamentos para una teoría funcional de la traducción , 1996 .

[39]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  Markus Strohmaier,et al.  Analyzing human intentions in natural language text , 2009, K-CAP '09.

[42]  James A. Rodger,et al.  Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive , 2015 .

[43]  Andreas Girgensohn,et al.  Genre identification for office document search and browsing , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[44]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[45]  Jebari Chaker A New Centroid-based Approach for Genre Categorization of Web Pages , 2009, J. Lang. Technol. Comput. Linguistics.

[46]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[47]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[48]  Xiaolong Wang,et al.  Genre identification of Chinese finance text using machine learning method , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[49]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[50]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[51]  Bonnie L. Webber,et al.  Squibs: Stable Classification of Text Genres , 2011, CL.

[52]  Mathias Kirsten,et al.  Exploring the Use of Linguistic Features in Domain and Genre Classification , 1999, EACL.

[53]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[54]  Charles L. A. Clarke,et al.  Towards genre classification for IR in the workplace , 2006, IIiX.

[55]  Rudy Prabowo,et al.  Sentiment analysis: A combined approach , 2009, J. Informetrics.

[56]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[57]  Horacio Saggion,et al.  Using genre-specific features for patent summaries , 2017, Inf. Process. Manag..

[58]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[59]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[60]  Jian Ma,et al.  Sentiment classification: The contribution of ensemble learning , 2014, Decis. Support Syst..