论文信息 - Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases word N-grams extracted from text and visual features extracted from a representative image from each document. The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0i¾ź%, which is much better than the corresponding accuracy result 58.4i¾ź% obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy 86.7i¾ź%, achieved by the Random Forests method.

[1] Charu C. Aggarwal,et al. Mining Text Data , 2012 .

[2] Jean-Philippe Domenger,et al. Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[3] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[4] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[5] Maria Teresa Pazienza,et al. Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[6] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[7] S. Sathiya Keerthi,et al. Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[8] Michael Gamon,et al. BLEWS: Using Blogs to Provide Context for News Articles , 2008, ICWSM.

[9] Sotiris B. Kotsiantis,et al. Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[10] Yaakov HaCohen-Kerner,et al. Cuisine: Classification using stylistic feature sets and-or name-based feature sets , 2010 .

[11] Alistair Kennedy,et al. SENTIMENT CLASSIFICATION of MOVIE REVIEWS USING CONTEXTUAL VALENCE SHIFTERS , 2006, Comput. Intell..

[12] Soo-Min Kim,et al. Automatic Identification of Pro and Con Reasons in Online Reviews , 2006, ACL.

[13] Daisuke Ikeda,et al. Semi-Supervised Learning for Blog Classification , 2008, AAAI.

[14] Yaakov HaCohen-Kerner,et al. Classifying Papers from Different Computer Science Conferences , 2013, ADMA.

[15] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[16] Fuchun Peng,et al. N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[17] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[18] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19] Toramatsu Shintani,et al. Automatic Detection of News Articles of Interest to Regional Communities , 2012 .

[20] Ming Zhou,et al. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[21] Azriel Rosenfeld,et al. Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.

[22] A. McCallum,et al. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23] Bernardo A. Huberman,et al. The Pulse of News in Social Media: Forecasting Popularity , 2012, ICWSM.

[24] Christopher J. Fox,et al. A stop list for general text , 1989, SIGF.

[25] Hagit Shatkay,et al. Exploring a new space of features for document classification: figure clustering , 2006, CASCON.

[26] Yaakov HaCohen-Kerner,et al. STYLISTIC FEATURE SETS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND ETHNIC ORIGIN , 2010, Appl. Artif. Intell..

[27] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[28] Koen E. A. van de Sande,et al. Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Arun K. Pujari,et al. N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[30] Cordelia Schmid,et al. Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31] Yiannis Kompatsiaris,et al. News Articles Classification Using Random Forests and Weighted Multimodal Features , 2014, IRFC.