Hierarchical multi-label news article classification with distributed semantic model based features

Automatic news categorization is essential to automatically handle the classification of multi-label news articles in online portal. This research employs some potential methods to improve performance of hierarchical multi-label classifier for Indonesian news article. First potential method is using Convolutional Neural Network (CNN) to build the top level classifier. The second method could improve the classification performance by calculating the average of the word vectors obtained from distributed semantic model. The third method combines lexical and semantic method to extract documents features, which multiplied word term frequency (lexical) with word vector average (semantic). Model build using Calibrated Label Ranking as multi-label classification method, and trained using Naive Bayes algorithm has the best F1-measure of 0.7531. Multiplication of word term frequency and the average of word vectors were also used to build this classifiers. This configuration improved multi-label classification performance by 4.25%, compared to the baseline. The distributed semantic model that gave best performance in this experiment obtained from 300-dimension word2vec of Wikipedia’s articles. The multi-label classification model performance is also influenced by news’ released date. The difference period between training and testing data would also decrease models’ performance.

[1]  Grigorios Tsoumakas,et al.  Multi-Label Classification , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  Mike Thelwall,et al.  A Study of Information Retrieval Weighting Schemes for Sentiment Analysis , 2010, ACL.

[4]  Víctor Robles,et al.  Feature selection for multi-label naive Bayes classification , 2009, Inf. Sci..

[5]  Jun Suzuki,et al.  Multi-label Text Categorization with Model Combination based on F1-score Maximization , 2008, IJCNLP.

[6]  Fernando Enríquez,et al.  An approach to the use of word embeddings in an opinion classification task , 2016, Expert Syst. Appl..

[7]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[8]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[9]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[10]  Grigorios Tsoumakas,et al.  Multilabel Text Classification for Automated Tag Suggestion , 2008 .

[11]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[12]  Joe Carthy,et al.  Combining semantic and syntactic document classifiers to improve first story detection , 2001, SIGIR '01.

[13]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[16]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[17]  M. L. Khodra,et al.  Hierarchical multilabel classification for Indonesian news articles , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[18]  Peerapon Vateekul,et al.  Hierarchical multi-label classification with SVMs: A case study in gene function prediction , 2014, Intell. Data Anal..

[19]  Li Li,et al.  Combining Lexical and Semantic Features for Short Text Classification , 2013, KES.

[20]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[21]  Masayu Leylia Khodra,et al.  Automatic multilabel classification for Indonesian news articles , 2015, 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[22]  Jin Wang,et al.  Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification , 2017, IJCAI.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Masayu Leylia Khodra,et al.  Word2vec semantic representation in multilabel classification for Indonesian news article , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[26]  Peng Wang,et al.  Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification , 2016, Neurocomputing.

[27]  Peretz Shoval,et al.  ONTOLOGY-BASED CLASSIFICATION OF NEWS IN AN ELECTRONIC NEWSPAPER , 2008 .

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.