Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact that the words in a collocation tend to co-occur in similar contexts as in bilingual word alignment. First, the monolingual corpus is replicated to generate a parallel corpus, in which each sentence pair consists of two identical sentences. Next, the monolingual word alignment algorithm is employed to align potentially collocated words. Finally, the aligned word pairs are ranked according to the alignment scores and candidates with higher scores are extracted as collocations. We conducted experiments on Chinese and English corpora respectively. Compared to previous approaches that use association measures to extract collocations from co-occurrence word pairs within a given window, our method achieves higher precision and recall. According to human evaluation, our method achieves precisions of 62% on a Chinese corpus and 64% on an English corpus. In particular, we can extract collocations with longer spans, achieving a higher precision of 83% on the long-span (> 6 words) Chinese collocations.

[1]  Kam-Fai Wong,et al.  Building a Chinese Collocation Bank , 2009, Int. J. Comput. Process. Orient. Lang..

[2]  Paul Rayson,et al.  Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool , 2006 .

[3]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[4]  Roberto Basili,et al.  A "not-so-shallow" parser for collocational analysis , 1994, COLING.

[5]  Ari Rappoport,et al.  Multi-Word Expression Identification Using Sentence Surface Features , 2009, EMNLP.

[6]  Olivier Ferret,et al.  Using Collocations for Topic Segmentation and Link Detection , 2002, COLING.

[7]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[8]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[9]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .

[10]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[11]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[12]  David Yarowsky,et al.  Statistical Machine Translation: Final Report , 1999 .

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  Eric Wehrli,et al.  Multilingual collocation extraction with a syntactic parser , 2009, Lang. Resour. Evaluation.

[15]  Stefan Evert,et al.  Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties , 2006 .

[16]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[17]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[18]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[19]  Stefan Evert,et al.  Significance tests for the evaluation of ranking methods , 2004, COLING.

[20]  Qin Lu,et al.  Similarity Based Chinese Synonym Collocation Extraction , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[21]  Haifeng Wang,et al.  Discriminative Pruning of Language Models for Chinese Word Segmentation , 2006, ACL.

[22]  Eric Wehrli,et al.  Sentence Analysis and Collocation Identification , 2010, MWE@COLING.

[23]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[24]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[25]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[26]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[27]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[28]  Elisabeth Breidt,et al.  Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German , 1996, VLC@ACL.

[29]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[30]  Geoffrey Williams In search of representativity in specialised corpora: Categorisation through collocation , 2002 .

[31]  Yin Li,et al.  Improving Xtract for Chinese collocation extraction , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[32]  Carlos Ramisch,et al.  A Hybrid Approach for Multiword Expression Identification , 2010, PROPOR.

[33]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[34]  Archibald Michiels,et al.  NEW DEVELOPMENTS IN THE DEFI MATCHER , 2000 .

[36]  Eric Wehrli,et al.  Accurate Collocation Extraction Using a Multilingual Parser , 2006, ACL.

[37]  Udo Hahn,et al.  Collocation Extraction Based on Modifiability Statistics , 2004, COLING.

[38]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[39]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[40]  S. Schulte,et al.  A Collocation Database for German Verbs and Nouns S ABINE S CHULTE , 2003 .

[41]  Ming Zhou,et al.  Synonymous Collocation Extraction Using Translation Information , 2003, ACL.

[42]  Darren Pearce,et al.  Synonymy in collocation extraction , 2001 .