We present a system which extracts word-based bigram and n-gram collocation information from a 60MB corpus and then locates bigram pairs using strength and spread as defined in the Xtract system. In order for Xtract to work effectively with Chinese, we have readjusted the parameters. To obtain a higher recall rate, we have modified the algorithm to identify collocations with low-frequency of occurrence, a method which works particularly well in the case of bigrams in which one word is high-frequency and the other low-frequency. In preliminary experiments, our system extracts bigram collocations with a precision of 61%, an 8% improvement over the direct use Smadja' Xtract on Chinese. Further, we have improved the recall rate by 4.5% while extracting multiword collocations with 92% precision.
[1]
George A. Miller,et al.
Introduction to WordNet: An On-line Lexical Database
,
1990
.
[2]
Yaacov Choueka,et al.
Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases
,
1988,
RIAO Conference.
[3]
Kenneth Ward Church,et al.
Word Association Norms, Mutual Information, and Lexicography
,
1989,
ACL.
[4]
M. Benson,et al.
Collocations and General-purpose Dictionaries
,
1990
.
[5]
Frank Smadja,et al.
Retrieving Collocations from Text: Xtract
,
1993,
CL.