An Extensive Empirical Study of Collocation Extraction Methods

This paper presents a status quo of an ongoing research study of collocations -- an essential linguistic phenomenon having a wide spectrum of applications in the field of natural language processing. The core of the work is an empirical evaluation of a comprehensive list of automatic collocation extraction methods using precision-recall measures and a proposal of a new approach integrating multiple basic methods and statistical classification. We demonstrate that combining multiple independent techniques leads to a significant performance improvement in comparison with individual basic methods.

[1]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[2]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[3]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[4]  Brigitte Krenn Collocation Mining: Exploiting Corpora for Collocation, Identification and Representation , 2000, KONVENS.

[5]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[6]  Kenji Kita,et al.  A comparative study of automatic extraction of collocations from corpora: mutual information vs , 1994 .

[7]  Nikos Fakotakis,et al.  Comparative Evaluation of Collocation Extraction Metrics , 2002, LREC.

[8]  ChengXiang Zhai,et al.  Exploiting Context to Identify Lexical Atoms - A Statistical View of Linguistic Context , 1997, ArXiv.

[9]  Lillian Lee,et al.  On the effectiveness of the skew divergence for statistical language analysis , 2001, AISTATS.

[10]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[11]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[13]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[14]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[15]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Ted Pedersen,et al.  Fishing for Exactness , 1996, ArXiv.