Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases word N-grams extracted from text and visual features extracted from a representative image from each document. The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0i¾ź%, which is much better than the corresponding accuracy result 58.4i¾ź% obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy 86.7i¾ź%, achieved by the Random Forests method.

[1]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[2]  Jean-Philippe Domenger,et al.  Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[8]  Michael Gamon,et al.  BLEWS: Using Blogs to Provide Context for News Articles , 2008, ICWSM.

[9]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[10]  Yaakov HaCohen-Kerner,et al.  Cuisine: Classification using stylistic feature sets and-or name-based feature sets , 2010 .

[11]  Alistair Kennedy,et al.  SENTIMENT CLASSIFICATION of MOVIE REVIEWS USING CONTEXTUAL VALENCE SHIFTERS , 2006, Comput. Intell..

[12]  Soo-Min Kim,et al.  Automatic Identification of Pro and Con Reasons in Online Reviews , 2006, ACL.

[13]  Daisuke Ikeda,et al.  Semi-Supervised Learning for Blog Classification , 2008, AAAI.

[14]  Yaakov HaCohen-Kerner,et al.  Classifying Papers from Different Computer Science Conferences , 2013, ADMA.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[17]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Toramatsu Shintani,et al.  Automatic Detection of News Articles of Interest to Regional Communities , 2012 .

[20]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[21]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[22]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Bernardo A. Huberman,et al.  The Pulse of News in Social Media: Forecasting Popularity , 2012, ICWSM.

[24]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[25]  Hagit Shatkay,et al.  Exploring a new space of features for document classification: figure clustering , 2006, CASCON.

[26]  Yaakov HaCohen-Kerner,et al.  STYLISTIC FEATURE SETS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND ETHNIC ORIGIN , 2010, Appl. Artif. Intell..

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[30]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Yiannis Kompatsiaris,et al.  News Articles Classification Using Random Forests and Weighted Multimodal Features , 2014, IRFC.