Idiomatic expressions formed from a verb and a noun in its dir ect object position are a productive cross-lingual class of multiword expressions, which can be used both idiomatically and as a li ter l combination. This paper presents the VNC-Tokens data set, resource of almost3000 English verb–noun combination usages annotated as to wheth er they are literal or idiomatic. Previous research using th is dataset is described, and other studies which could be evalu ated more extensively using this resource are identified. 1. Verb–Noun Combinations Identifying multiword expressions (MWEs) in text is essential for accurately performing natural language processin g tasks (Sag et al., 2002). A broad class of MWEs with distinct semantic and syntactic properties is that of idiomati c expressions. A productive process of idiom creation across languages is to combine a high frequency verb and one or more of its arguments. In particular, many such idioms are formed from the combination of a verb and a noun in the direct object position (Cowie et al., 1983; Nunberg et al., 1994; Fellbaum, 2002), e.g., give the sack , make a face , and see stars . Given the richness and productivity of the class of idiomatic verb–noun combinations (VNCs), we choose to focus on these expressions. It is a commonly held belief that expressions with an idiomatic interpretation are primarily used idiomatically, and that they lose their literal meanings over time. Nonetheless, it is still possible for a potentially-idiomatic comb ination to be used in a literal sense, as in: Shemade a face on the snowman using a carrot and two buttons . Contrast the above literal usage with the idiomatic use in: The little girl made a funnyface at her mother . Interestingly, in our analysis of60 VNCs, we found that approximately half of these expressions are attested fairly frequently in their l iteral sense in the British National Corpus (BNC). 1 Clearly, automatic methods are required for distinguishing between idiomatic and literal usages of such expressions, and indee d there have recently been several studies addressing this is sue (Birke and Sarkar, 2006; Katz and Giesbrecht, 2006; Cook et al., 2007). In order to conduct further research on VNCs at the token level, and to compare the effectiveness of the varying proposed methods for their treatment, an annotated corpus of VNC usages is required. Section 2 describes our dataset, VNC-Tokens, which consists of almost 3000 English sentences, each containing a VNC usage (token) annotated as to whether it is literal or idiomatic. Sections 3, 4, and 5 respectively describe previous research conducted using VNC-Tokens, other work on idioms which could make use of this dataset, and possible ways in which VNC-Tokens could be extended. We summarize the contributions of the VNC-Tokens resource in Section 6. http://www.natcorp.ox.ac.uk 2. The VNC-Tokens Dataset The following subsections describe the selection of the expressions in VNC-Tokens, how usages of these expressions were found, and the annotation of the tokens.
[1]
I. R. McCaig,et al.
Oxford Dictionary of Current Idiomatic English
,
1994
.
[2]
Susanne Z. Riehemann,et al.
A constructional approach to idioms and word formation
,
2001
.
[3]
Timothy Baldwin,et al.
Multiword Expressions: A Pain in the Neck for NLP
,
2002,
CICLing.
[4]
Michael Collins,et al.
Head-Driven Statistical Models for Natural Language Parsing
,
2003,
CL.
[5]
Satoshi Sato,et al.
Detecting Japanese idioms with a linguistically rich dictionary
,
2006,
Lang. Resour. Evaluation.
[6]
Anoop Sarkar,et al.
A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language
,
2006,
EACL.
[7]
Eugenie Giesbrecht,et al.
Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis
,
2006
.
[8]
Afsaneh Fazly,et al.
Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations
,
2006,
EACL.
[9]
Afsaneh Fazly,et al.
Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context
,
2007
.
[10]
Afsaneh Fazly,et al.
Unsupervised Type and Token Identification of Idiomatic Expressions
,
2009,
CL.