Enhanced Double-Carrier Word Embedding via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing. Yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English, and Spanish as examples and evaluated these models through word similarity and text classification experiments.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[3]  Huanhuan Chen,et al.  Improve Chinese Word Embeddings by Exploiting Internal Structure , 2016, NAACL.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yong Yu,et al.  Learning Word Representation Considering Proximity and Ambiguity , 2014, AAAI.

[9]  Jiajun Zhang,et al.  Associative Multichannel Autoencoder for Multimodal Word Representation , 2018, EMNLP.

[10]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[11]  Gary Geunbae Lee,et al.  Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems , 2017, Pattern Recognit. Lett..

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[14]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[15]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[16]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jun Zhou,et al.  cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information , 2018, AAAI.

[18]  Wei Lu,et al.  Improving Word Embeddings with Convolutional Feature Learning and Subword Information , 2017, AAAI.

[19]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Peng Jin,et al.  SemEval-2012 Task 4: Evaluating Chinese Word Similarity , 2012, SemEval@NAACL-HLT.

[22]  F. Saussure,et al.  Course in General Linguistics , 1960 .

[23]  Ashwin K. Vijayakumar,et al.  Sound-Word2Vec: Learning Word Representations Grounded in Sounds , 2017, EMNLP.

[24]  Yang Xu,et al.  Implicitly Incorporating Morphological Information into Word Embedding , 2017, ArXiv.

[25]  Edward Sapir,et al.  Language: An Introduction to the Study of Speech , 1955 .

[26]  Stephen Clark,et al.  Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception , 2015, EMNLP.

[27]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[28]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[29]  Ming Zhou,et al.  Bilingually-constrained Phrase Embeddings for Machine Translation , 2014, ACL.

[30]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[31]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[32]  Evgeniy Gabrilovich,et al.  Large-scale learning of word relatedness with constraints , 2012, KDD.

[33]  Stephen Clark,et al.  Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception , 2017, J. Artif. Intell. Res..

[34]  Lin-Shan Lee,et al.  Phonetic-and-Semantic Embedding of Spoken words with Applications in Spoken Content Retrieval , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[35]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[36]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[37]  Dina Wonsever,et al.  Spanish Word Vectors from Wikipedia , 2016, LREC.

[38]  Rada Mihalcea,et al.  Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge , 2009, EMNLP.

[39]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[40]  Very Large Corpora Empirical Methods in Natural Language Processing , 1999 .