A Distributional Semantics Based Syntagmatic Association Measuring Method

Two kinds of relations exist between words, syntagmatic and paradigmatic. Word embedding as a state-of-the-art model of distributional semantics has been used to discover the paradigmatic relations between words and has been widely used in natural language processing tasks. Based on a hypothesis that at sentence level, except for words in paradigmatic relations, two words in certain syntagmatic relation are more similar than those not in any syntagmatic relations, we propose to discover words in syntagmatic relations in a sentence using word embedding based similarity computation. The experiments prove that word embedding based similarity between words in syntagmatic relations is higher than that between words not in any syntagmatic relations. And word embedding based method is competitive to the best measures in literature and can be a good complement to those measures. This discover can be conducive to many syntagmatic related natural language processing tasks such as parsing, text generation, machine translation, collocation extraction and multi-word expression recognition. Further experiment in collocation extraction shows that the proposed word embedding based association measure is effective in filtering the noisy collocation candidates at sentence level and it outperforms the existing well-known association measures in all precision, recall and Fmeasure.

[1]  Adam Kilgarriff,et al.  Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[4]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[5]  Degen Huang,et al.  HMM Revises Low Marginal Probability by CRF for Chinese Word Segmentation , 2010, CIPS-SIGHAN.

[6]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[7]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[8]  Roberto Carlini,et al.  Improving Collocation Correction by Ranking Suggestions Using Linguistic Knowledge , 2014 .

[9]  Timothy Baldwin,et al.  A Word Embedding Approach to Predicting the Compositionality of Multiword Expressions , 2015, NAACL.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  F. Saussure,et al.  Course in General Linguistics , 1960 .

[12]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[13]  Charles L. A. Clarke,et al.  Lexical Comparison Between Wikipedia and Twitter Corpora by Using Word Embeddings , 2015, ACL.

[14]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[15]  Omer Levy,et al.  A Simple Word Embedding Model for Lexical Substitution , 2015, VS@HLT-NAACL.

[16]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[17]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[18]  Wanxiang Che,et al.  Learning Semantic Hierarchies via Word Embeddings , 2014, ACL.

[19]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[20]  Meghdad Farahmand,et al.  Modeling the Non-Substitutability of Multiword Expressions with Distributional Semantics and a Log-Linear Model , 2016, MWE@ACL.

[21]  Wan-yin Claire Li Chinese collocation extraction and its application in natural language processing , 2007 .

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Vincent Vandeghinste,et al.  An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees , 2010, MWE@COLING.

[24]  Jaromír Antoch,et al.  Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves , 2013, J. Classif..

[25]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[26]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[27]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[28]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[29]  Eric Wehrli,et al.  Extraction of multi-word collocations using syntactic bigram composition , 2003 .

[30]  Pavel Pecina An Extensive Empirical Study of Collocation Extraction Methods , 2005, ACL.

[31]  Shouxun Yang Machine learning for collocation identification , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[32]  Roberto Carlini,et al.  Semantics-Driven Recognition of Collocations Using Word Embeddings , 2016, ACL.

[33]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[34]  Gerlof Bouma Collocation Extraction beyond the Independence Assumption , 2010, ACL.

[35]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[36]  Kam-Fai Wong,et al.  Classification-based Chinese Collocation Extraction , 2007, 2007 International Conference on Natural Language Processing and Knowledge Engineering.