Collocation Extraction using Parallel Corpus

This paper presents a novel method to extract the collocations of the Persian language using a parallel corpus. The method is applicable having a parallel corpus between a target language and any other high-resource one. Without the need for an accurate parser for the target side, it aims to parse the sentences to capture long distance collocations and to generate more precise results. A training data built by bootstrapping is also used to rank the candidates with a log-linear model. The method improves the precision and recall of collocation extraction by 5 and 3 percent respectively in comparison with the window-based statistical method in terms of being a Persian multi-word expression.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Mike Dillinger,et al.  Collocation Extraction for Machine Translation , 2003 .

[3]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[4]  Hua Wu,et al.  Two-Word Collocation Extraction Using Monolingual Word Alignment Method , 2011, TIST.

[5]  Ming Zhou,et al.  Collocation Translation Acquisition Using Monolingual Corpora , 2004, ACL.

[6]  Gaël Dias,et al.  Multiword Unit Hybrid Extraction , 2003, ACL 2003.

[7]  Heshaam Faili,et al.  Unsupervised Identification of Persian Compound Verbs , 2011, MICAI.

[8]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[9]  M. Nagao,et al.  Machine Translation Summit , 1989 .

[10]  J. Sinclair Collocation: a progress report , 1987 .

[11]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[12]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[13]  Eric Wehrli,et al.  Accurate Collocation Extraction Using a Multilingual Parser , 2006, ACL.

[14]  Roberto Basili,et al.  Semi-automatic extraction of linguistic information for syntactic disambiguation , 1993, Appl. Artif. Intell..

[15]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[16]  Hua Wu,et al.  Improving Statistical Machine Translation with Monolingual Collocation , 2010, ACL.

[17]  Heshaam Faili,et al.  TEP: Tehran English-Persian Parallel Corpus , 2011, CICLing.

[18]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[19]  Hua Wu,et al.  Collocation Extraction Using Monolingual Word Alignment Method , 2009, EMNLP.