Compilation of an idiom example database for supervised idiom identification

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89.25 and 88.86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

[1]  Stefan Evert,et al.  Multiword expressions: hard going or plain sailing? , 2010, Lang. Resour. Evaluation.

[2]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[3]  Afsaneh Fazly,et al.  Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations , 2006, EACL.

[4]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[5]  Sadao Kurohashi,et al.  Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words , 2008, ACL.

[6]  Kenji Yoshimura,et al.  MWEs as Non-propositional Content Indicators , 2004 .

[7]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[8]  Stefan Evert,et al.  ACL 2006 Multiword Expressions : Identifying and Exploiting Underlying Properties , 2006 .

[9]  Adam Kilgarriff,et al.  Introduction to the Special Issue on SENSEVAL , 2000, Comput. Humanit..

[10]  Heidi Quinn,et al.  A syntactically annotated idiom dataset (SAID) , 2003 .

[11]  Anoop Sarkar,et al.  A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language , 2006, EACL.

[12]  Timothy Baldwin,et al.  Disambiguating Japanese compound verbs , 2005, Comput. Speech Lang..

[13]  Stefan Evert,et al.  Proceedings of the Workshop on a Broader Perspective on Multiword Expressions , 2007 .

[14]  Carlo Strapparava,et al.  The role of domain information in Word Sense Disambiguation , 2002, Natural Language Engineering.

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Scott Cotton,et al.  SENSEVAL-2: Overview , 2001, *SEMEVAL.

[17]  Daisuke Kawahara,et al.  Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features , 2008, EMNLP.

[18]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[19]  Sadao Kurohashi,et al.  Construction of Domain Dictionary for Fundamental Vocabulary , 2007, ACL.

[20]  Akinori Yonezawa,et al.  World wide web crawler , 2002, WWW 2002.

[21]  Satoshi Sato,et al.  Japanese Idiom Recognition: Drawing a Line between Literal and Idiomatic Meanings , 2006, ACL.

[22]  Hitoshi Isahara,et al.  Development of the Japanese WordNet , 2008, LREC.

[23]  Satoshi Sato,et al.  Detecting Japanese idioms with a linguistically rich dictionary , 2006, Lang. Resour. Evaluation.

[24]  Daisuke Kawahara,et al.  Case Frame Compilation from the Web using High-Performance Computing , 2006, LREC.

[25]  Makoto Nagao,et al.  A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures , 1994, CL.

[26]  Timothy Baldwin,et al.  Word Sense Disambiguation Incorporating Lexical and Structural Semantic Information , 2007, EMNLP.

[27]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[28]  Stefan Evert,et al.  Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties , 2006 .