A Continuum-Based Approach for Tightness Analysis of Chinese Semantic Units

Chinese semantic units fall into a continuum of connection tightness, ranging from very tight, non-compositional expressions, tight compositional words, phrases, and then to loose more or less arbitrary combinations of words. We propose an approach to measure tightness connection within this continuum, based on document frequency of segmentation patterns in a reference corpus. A variety of corpora, including search engine snippets, search engine results derived from query logs, as well as standard corpora have been investigated. Our tightness ranking on 300 phrases is quite close to their manual ranking, and non-compositional compound extraction can achieve a precision as high as 94.3% on the top 1,000 4-grams extracted from the Chinese Gigaword corpus.

[1]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[2]  Qin Lu,et al.  A Multi-stage Chinese Collocation Extraction System , 2005, ICMLC.

[3]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[4]  C. F. Kossack,et al.  Rank Correlation Methods , 1949 .

[5]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[6]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[7]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[8]  W. Hoeffding,et al.  Rank Correlation Methods , 1949 .

[9]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[10]  Hui Li,et al.  Chinese word segmentation and its effect on information retrieval , 2004, Inf. Process. Manag..

[11]  Timothy Baldwin,et al.  Detecting Compositionality of English Verb-Particle Constructions using Semantic Similarity , 2007 .

[12]  John Carroll,et al.  Detecting a Continuum of Compositionality in Phrasal Verbs , 2003, ACL 2003.

[13]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[14]  Dale Schuurmans,et al.  Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR , 2002, COLING.

[15]  Timothy Baldwin,et al.  A Statistical Approach to the Semantics of Verb-Particles , 2003, ACL 2003.