论文信息 - A Word Embedding Approach to Identifying Verb-Noun Idiomatic Combinations

A Word Embedding Approach to Identifying Verb-Noun Idiomatic Combinations

Verb–noun idiomatic combinations (VNICs) are idioms consisting of a verb with a noun in its direct object position. Usages of these expressions can be ambiguous between an idiomatic usage and a literal combination. In this paper we propose supervised and unsupervised approaches, based on word embeddings, to identifying token instances of VNICs. Our proposed supervised and unsupervised approaches perform better than the supervised and unsupervised approaches of Fazly et al. (2009), respectively. 1 Verb–noun Idiomatic Combinations Much research on multiword expressions (MWEs) in natural language processing (NLP) has focused on various type-level prediction tasks, e.g., MWE extraction (e.g., Church and Hanks, 1990; Smadja, 1993; Lin, 1999) — i.e., determining which MWE types are present in a given corpus (Baldwin and Kim, 2010) — and compositionality prediction (e.g., McCarthy et al., 2003; Reddy et al., 2011; Salehi et al., 2014). However, word combinations can be ambiguous between literal combinations and MWEs. For example, consider the following two usages of the expression hit the roof : 1. I think Paula might hit the roof if you start ironing. 2. When the blood hit the roof of the car I realised it was serious. The first example of hit the roof is an idiomatic usage, while the second is a literal combination.1 MWE identification is the task of determining These examples, and idiomaticity judgements, are taken from Cook et al. (2008). which token instances in running text are MWEs (Baldwin and Kim, 2010). Although there has been relatively less work on MWE identification than other type-level MWE prediction tasks, it is nevertheless important for NLP applications such as machine translation that must be able to distinguish MWEs from literal combinations in context. Some recent work has focused on token-level identification of a wide range of types of MWEs and other multiword units (e.g., Newman et al., 2012; Schneider et al., 2014; Brooke et al., 2014). Many studies, however, have taken a word sense disambiguation–inspired approach to MWE identification (e.g., Birke and Sarkar, 2006; Katz and Giesbrecht, 2006; Li et al., 2010), treating literal combinations and MWEs as different word senses, and have exploited linguistic knowledge of MWEs (e.g., Patrick and Fletcher, 2005; Uchiyama et al., 2005; Hashimoto and Kawahara, 2008; Fazly et al., 2009; Fothergill and Baldwin, 2012). In this study we focus on English verb–noun idiomatic combinations (VNICs). VNICs are formed from a verb with a noun in its direct object position. They are a common and productive type of English idiom, and occur cross-lingually (Fazly et al., 2009). VNICs tend to be relatively lexico-syntactically fixed, e.g., whereas hit the roof is ambiguous between literal and idiomatic meanings, hit the roofs and a roof was hit are most likely to be literal usages. Fazly et al. (2009) exploit this property in their unsupervised approach, referred to as CFORM. They define lexico-syntactic patterns for VNIC token instances based on the noun’s determiner (e.g., a, the, or possibly no determiner), the number of the noun (singular or plural), and the verb’s voice (active or passive). They propose a statistical method for automatically determining a given VNIC type’s canonical idiomatic form, based on the frequency of its usage in these