Semantically Readable Distributed Representation Learning and Its Expandability Using a Word Semantic Vector Dictionary

Spirit·Psychology Sense, Emotion, Happiness, Sadness concept Abstract concept State·Aspect, Change, Relationship Physics· Motion Motion, Halt, Dynamic, Static Substance Physical characteristics Warmth, Weight, Lightness, Flexible Civilization· Humanities Race, Knowledge, Speech Information Science Mathematics, Physics, Astronomy Table 2 Grant criteria by logical relationship. Logical relationship Core words Feature words Class inclusion Autumn Season Synonym relationship Idea Thought Part–whole relationship Leg Human body Table 3 Grant criteria by associative relationship. Core words Feature words Love Kindness, Warmth Up Economy, Video Leg Car, Traffic·Transportation words are related to core words by association, as shown in Table 3. 3.3 Model Setting for Testing the Hypothesis In this section, we describe the setting to encode the initial weights of the core words based on the strength of the relationship with each feature word, and we describe our test of the hypothesis using Skip-gram as an example. A method has been proposed for generating a word vector by recursively expanding a definition sentence for a word in a dictionary [19]. The word semantic vector dictionary can be regarded as defining a core word with 266 types of feature words. Because feature words are also core words, recursive extension is necessary. However, convergence occurs when the feature word is expanded several times because the definition sentence of the core word is limited to 266 words. Also, a method has been proposed for retrofitting word vectors according to related words in a dictionary [14]. In that method, we generate a seed vector of the core word by recursively expanding the dictionary using retrofitting tools†. When building a vocabulary from the corpus, the initial vectors of the following two kinds are created first. †https://github.com/mfaruqui/retrofitting KESHI et al.: SEMANTICALLY READABLE DISTRIBUTED REPRESENTATION LEARNING AND ITS EXPANDABILITY 1069 Fig. 2 Example of retrofitting “Disease.” • The 266 feature words are added to the vocabulary as one-hot vectors with dimensions corresponding to respective feature words set to 1. • Other initial word vectors including core words extracted from the corpus are 266-dimensional zero vectors. The retrofitting algorithm aims at bringing word vectors closer to the relationship of the word entries of the lexicon as post-processing of learning of word vectors [14]. We applied this algorithm for retrofitting the aforementioned initial word vectors, which are 266-dimensional one-hot or zero vectors, into the word semantic vector dictionary. The retrofitting algorithm is shown as the following online update [14]: qi = ∑ j:(i, j)∈E βi jqj + αiq̂i ∑ j:(i, j)∈E βi j + αi (1) qi is the retrofitted word vector for the core word wi, q̂i is the aforementioned initial vector for wi, and αi is the weight of the initial vector; currently it is set to the number of given feature words w j for wi. qj is the retrofitted word vector for the given feature word w j, and βi j is the weight of the given feature word w j for the core word wi; currently, the weight βi j is set to 1. Equation (1) multiplies the initial vector q̂i of the core word wi by the weight αi, adding the vectors obtained by multiplying the retrofitted vector qj of the given feature word w j by the weight βi j and by dividing it by the sum of both weights. Running the online update for about ten iterations with the retrofitted word vector qi as the next initial vector q̂i increases the relationship between each core word and 266 feature words from an average of 9 to an average of 100. The relationship is increased for each core words to expand the feature words given to the core word recursively. The size of the retrofitted word vector is normalized to 1. Figure 2 presents an example of retrofitting “Disease,” which is a feature word and core word in the dictionary. The points of this algorithm are the following. • The retrofitted word vector is close to the original vector. In the case of “Disease,” the original vector is a one-hot vector. • When the feature words assigned to a retrofitted core word are not expanded as core words, the weights of the feature words are almost equal. When expanded, Fig. 3 Skip-gram model setting for testing. the weights decrease according to the number of feature words to be expanded. In the vocabulary, each word has two vectors. One is an input vector, which is the weights between the input node and each hidden node, and the other is an output vector, which is the weight between each hidden node and the output node. The retrofitted word vector was used as the seed vector of the input vector. The initial weights of words other than core words were set to 0. Also, the initial weights of the output vector for all words including core words were set to 0, which is the default setting of gensim’s doc2vec library†. Figure 3 presents an example of the Skip-gram setting for testing the hypothesis. The input layer specifies the target word. The output layer consists of three context words appearing around the target word. The hidden layer comprises the nodes corresponding to 266 feature words. The weights of the target word for each hidden node are retrofitted weights. Each weight is updated by back propagation so that the probability of predicting the context words increases when the target word is input. The objective function is the following [3], [20]. E = − logσ ( vw T h ) − ∑ w j∈Wneg logσ ( −vwj T h ) (2) Because the activation function of the hidden nodes is linear, the hidden layer outputting h is vwi T . vw is an input vector with initial weights that are generated by Eq. (1), and v′w is the output vector of the word w. Wneg is the set of words for negative sampling. The output vector is updated as follows [20]. vwj (new) = vwj (old) − η ( σ ( vwj (old)T h ) − t j ) h (3) where t j is 1 when w j is the context word and 0 otherwise. The initial output vector v′w is 0. Thus, the output vectors of the context words become close to the input vector, which is the seed vector, of the target word. 4. Verification of the Hypothesis with a Single Domain Benchmark In these experiments, we examined the relationship between †https://radimrehurek.com/gensim/models/doc2vec.html 1070 IEICE TRANS. INF. & SYST., VOL.E101–D, NO.4 APRIL 2018 Table 4 Hyper-parameter settings for learning word vectors. Hyper-parameters Values Dimensionality of the feature vectors 266 Number of iterations over the corpus 20 Learning rate Initial:0.025, Minimum:0.0001 Window size 5 Downsample threshold for words 1e-5 Number of negative sampling words 15 sentiment analysis using a single domain benchmark and readability of tweet embeddings in a user test. We also tested the hypothesis on whether or not weights obtained based on learning and weights based on the dictionary are correlated in a closed test and an open test, compared with a control test. We used the single domain benchmark of sentiment analysis for Product B and the 560,853 unlabeled tweets in Appendix A. For the 560,853 unlabeled tweets, only noises such as the URL and the account name were deleted. The evaluation benchmark of sentiment analysis consisted of 11,774 tweets of one product brand labeled using crowdsourcing as either positive, negative, or neutral [7]. Japanese morphological analysis, MeCab† and its dictionary mecabipadic-NEologd††, which expanded MeCab’s default dictionary by millions of new words and named entities from language resources on the Web, were used to extract words from tweets. We used the inflections of verbs and adjectives as different words without transforming to their original forms to let word embeddings learn their context. The number of words extracted from the corpus in five or more times was 30,468, while the number of retrofitted core words was 6,814 words. 4.1 Learning Word Vectors by Our Method and Evaluation of Correlation Coefficients We updated word vectors using two variants of paragraph vector models with unlabeled tweets only using gensim’s doc2vec library. On the basis of the accuracy of the sentiment analysis of the final stage, we decided the values of hyper-parameters for paragraph vector learning of the conventional method. The hyper-parameter settings for learning the corpus are shown in Table 4. Our method used the same hyper-parameter settings. Here, the size of the feature vectors was adjusted to the number of feature words, 266. When the number of dimensions of the feature vectors exceeded 266, we could set the initial value 0 or the random number for the part exceeding 266 in our method. However, no difference occurred in accuracy between 266 dimensions and 300 dimensions for the corpus in the paragraph vector of the conventional method. Thus, we utilized 266 dimensions. Both the PV-DM and PV-DBOW have the same hyper-parameter settings. We used the sum of the input vectors for the hidden layer of the PV-DM for the same reason as with the hyper-parameter settings. †http://mecab.googlecode.com/svn/trunk/mecab/doc/index. html ††https://github.com/neologd/mecab-ipadic-neologd Table 5 Example of retrofitted and learned word vectors for a core word that is a feature word itself. Generation method feature words and weights arranged in descending order Retrofitted travel:0.97, traffic·transportation:0.12, hobby·recreation:0.1, vector for home ·family:0.1, service industry:0.1, airplane:0.06, “travel” human:0.05, car:0.05, overseas:0.05, Japan:0.05, learned vector for travel:1.41, machine:0.65, image:0.61, company:0.55, “travel” state·aspect:0.52, traffic·transportation:0.5, hobby· by PV-DM recreation:0.43, education:0.40, facility:0.38, behavior:0.36, learned vector for travel:1.45, time:0.45, custom:0.44, clothes:0.43, “travel” state·aspect:0.43, Europe:0.42, low:0.42, image:0.41, by PV-DBOW public system:0.40, machine:0.40, Table 6 Evaluation results 1: Correlation coefficients between initial and learned word vectors

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[5]  Satoshi Suzuki Probabilistic Word Vector and Similarity Based on Dictionaries , 2003, CICLing.

[6]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[7]  Ken-ichi Kawarabayashi,et al.  Joint Word Representation Learning Using a Corpus and a Semantic Lexicon , 2015, AAAI.

[8]  Xiaomo Liu,et al.  Tweet Topic Classification Using Distributed Language Representations , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[9]  Ikuo Keshi,et al.  Associative image retrieval using knowledge in encyclopedia text , 1996, Systems and Computers in Japan.

[10]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[11]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[12]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[13]  Satoshi Nakamura,et al.  Semantically readable distributed representation learning for social media mining , 2017, WI.

[14]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[15]  Mamoru Komachi,et al.  Construction of a Japanese Word Similarity Dataset , 2017, LREC.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[19]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[20]  Soroush Vosoughi,et al.  Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder , 2016, SIGIR.

[21]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.