Word2vec semantic representation in multilabel classification for Indonesian news article

Mutilabel text classification is task to categorize a text into one or more categories. Like other supervised learning, performance of multilabel classification is limited when there are small labeled data and it leads to the difficulty of capturing semantic relationship. The previous research of multilabel classification for Indonesian news article focused on implementing multilabel classification using lexical feature that employed bag of words and TF-IDF term weighting, and there is no work yet that uses semantic features. The purpose of this paper is to present an implementation of multilabel classification using semantic feature based on Word2vec. Word2vec is an unsupervised task that is capable of utilizing unlabeled data to convert a word into its vector representation that can also find the semantic relationship between words by counting their distance. The experiment shows that the result using this semantic feature improves the previous result that used traditional bag of words and TF-IDF method. It escalates the testing F-measure value from 76.73% to 80.17%.

[1]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[6]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[7]  Amit Thakkar,et al.  A Survey and Current Research Challenges in Multi-Label Classification Methods , 2012 .

[8]  Hua Xu,et al.  Chinese sentiment classification using a neural network tool — Word2vec , 2014, 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI).

[9]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[10]  Li Li,et al.  Learning Semantic Similarity for Multi-label Text Categorization , 2014, CLSW.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[15]  Masayu Leylia Khodra,et al.  Automatic multilabel classification for Indonesian news articles , 2015, 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA).