Semantically readable distributed representation learning for social media mining

The problem with distributed representations generated by neural networks is that the meaning of the features is difficult to understand. We propose a new method that gives a specific meaning to each node of a hidden layer by introducing a manually created word semantic vector dictionary into the initial weights and by using paragraph vector models. Our experimental results demonstrated that weights obtained based on learning and weights based on the dictionary are more strongly correlated in a closed test and more weakly correlated in an open test, compared with the results of a control test. Additionally, we found that the learned vector are better than the performance of the existing paragraph vector in the evaluation of the sentiment analysis task. Finally, we determined the readability of document embedding in a user test. The definition of readability in this paper is that people can understand the meaning of large weighted features of distributed representations. A total of 52.4% of the top five weighted hidden nodes were related to tweets where one of the paragraph vector models learned the document embedding. Because each hidden node maintains a specific meaning, the proposed method succeeds in improving readability.

[1]  Soroush Vosoughi,et al.  Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder , 2016, SIGIR.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[4]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[5]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[6]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[7]  Ikuo Keshi,et al.  Associative image retrieval using knowledge in encyclopedia text , 1996, Systems and Computers in Japan.

[8]  Satoshi Suzuki Probabilistic Word Vector and Similarity Based on Dictionaries , 2003, CICLing.

[9]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[10]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[13]  Ken-ichi Kawarabayashi,et al.  Joint Word Representation Learning Using a Corpus and a Semantic Lexicon , 2015, AAAI.

[14]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Xiaomo Liu,et al.  Tweet Topic Classification Using Distributed Language Representations , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.