A new collocation extraction method combining multiple association measures

As an important linguistic resource, collocation represents a significant relation between words. Automatic collocation extraction is very important for many natural language processing applications, such as word sense disambiguation, machine translation and information retrieval etc. While traditional collocation extraction approaches use only one single statistical measure, they may not be optimal in that they can not take advantage of multiple statistical measures. In this paper, we propose a logistic linear regression model (LLRM) that combines five classical lexical association measures: x2-test, t-test, co-occurrence frequency, log-likelihood ratio and mutual information. Experiments show that our approach leads to a significant performance improvement in comparison with individual basic methods in both precision and recall.