Automatic Detection of Idiomatic Clauses

We describe several experiments whose goal is to automatically identify idiomatic expressions in written text. We explore two approaches for the task: 1) idiom recognition as outlier detection; and 2) supervised classification of sentences. We apply principal component analysis for outlier detection. Detecting idioms as lexical outliers does not exploit class label information. So, in the following experiments, we use linear discriminant analysis to obtain a discriminant subspace and later use the three nearest neighbor classifier to obtain accuracy. We discuss pros and cons of each approach. All the approaches are more general than the previous algorithms for idiom detection --- neither do they rely on target idiom types, lexicons, or large manually annotated corpora, nor do they limit the search space by a particular type of linguistic construction.

[1]  Anoop Sarkar,et al.  A Clustering Approach for Nearly Unsupervised Recognition of Nonliteral Language , 2006, EACL.

[2]  Christiane Fellbaum,et al.  Corpus-based Studies of German Idioms and Light Verbs , 2006 .

[3]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[4]  Maggie Seaton,et al.  Collins COBUILD idioms dictionary , 2011 .

[5]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[6]  Cristina Cacciari,et al.  The place of idioms in a literal and metaphorical world. , 1993 .

[7]  Yves Bestgen,et al.  Towards Automatic Retrieval of Idioms in French Newspaper Corpora , 2003, Lit. Linguistic Comput..

[8]  Robert F. Ling,et al.  Applied Multivariate Data Analysis, Vol. I: Regression and Experimental Design (J. D. Jobson) , 1992, SIAM Rev..

[9]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[10]  P. Davies,et al.  Kendall's Advanced Theory of Statistics. Volume 1. Distribution Theory , 1988 .

[11]  Aline Villavicencio,et al.  Lexical Encoding of MWEs , 2004 .

[12]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[13]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[14]  Jing Peng,et al.  Computing Linear Discriminants for Idiomatic Sentence Detection , 2009 .

[15]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[16]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[17]  Caroline Sporleder,et al.  A Cohesion Graph Based Approach for Unsupervised Recognition of Literal and Non-literal Use of Multiword Expressions , 2009, Graph-based Methods for Natural Language Processing.

[18]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[19]  Caroline Sporleder,et al.  Unsupervised Recognition of Literal and Non-Literal Use of Idiomatic Expressions , 2009, EACL.

[20]  I. Sag,et al.  Idioms , 2015 .

[21]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[22]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  I. R. McCaig,et al.  Oxford Dictionary of Current Idiomatic English , 1994 .

[24]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[25]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[26]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[27]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[28]  M. Kendall,et al.  Kendall's Advanced Theory of Statistics: Volume 1 Distribution Theory , 1987 .

[29]  Caroline Sporleder,et al.  Using Gaussian Mixture Models to Detect Figurative Language in Context , 2010, NAACL.

[30]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[31]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[32]  F. Kianifard Applied Multivariate Data Analysis: Volume II: Categorical and Multivariate Methods , 1994 .