Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses

In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.

[1]  Alon Itai,et al.  Automatic Processing of Large Corpora for the Resolution of Anaphora References , 1990, COLING.

[2]  Breck Baldwin,et al.  CogNIAC: high precision coreference with limited knowledge and linguistic resources , 1997 .

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[5]  Alexander H. Waibel,et al.  Towards better language models for spontaneous speech , 1994, ICSLP.

[6]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[7]  Winfried Lenders Past and future goals of Computational Linguistics , 2001 .

[8]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[9]  Keh-Yih Su,et al.  A Corpus-Based Approach to Automatic Compound Extraction , 1994, ACL.

[10]  Keh-Yih Su,et al.  A Level-synchronous Approach to Ill-formed Sentence Parsing , 1997, ROCLING/IJCLCLP.

[11]  Ruslan Mitkov,et al.  Evaluation Tool for Rule-based Anaphora Resolution Methods , 2001, ACL.

[12]  Chu-Ren Huang,et al.  SINICA CORPUS : Design Methodology for Balanced Corpora , 1996, PACLIC.

[13]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus and its Applications , 2001 .

[14]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[15]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[16]  Sam Coates-Stephens,et al.  The analysis and acquisition of proper names for robust text understanding , 1992 .

[17]  Constantin Orasan,et al.  Improving anaphora resolution by identifying animate entities in texts , 2002 .

[18]  Hsin-Hsi Chen,et al.  Description of the NTU System used for MET-2 , 1998, MUC.

[19]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[22]  Albert N. Tabah Information epidemics and the growth of physics , 1996 .

[23]  Hsin-Hsi Chen,et al.  Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation , 1994, ACL.

[24]  Karel Oliva,et al.  (Semi-)Automatic Detection of Errors in PoS-Tagged Corpora , 2002, COLING.

[25]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[26]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[27]  Dekang Lin Using Collocation Statistics in Information Extraction , 1998, MUC.

[28]  Shih-Hung Wu,et al.  FAQ-Centered Organizational Memory , 2002 .

[29]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[30]  Claire Cardie,et al.  Noun Phrase Coreference as Clustering , 1999, EMNLP.

[31]  Yuji Matsumoto,et al.  Detecting Errors in Corpora Using Support Vector Machines , 2002, COLING.

[32]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[33]  Shuanhu Bai,et al.  Description of the Kent Ridge Digital Labs System Used for MUC-7 , 1998, MUC.

[34]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[35]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[36]  Graeme Hirst,et al.  Acquiring Collocations for Lexical Choice between Near-Synonyms , 2002, Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition -.

[37]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without Dictionary From a Chinese corpus, its Applications , 2001, J. Inf. Sci. Eng..

[38]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[39]  Branimir Boguraev,et al.  Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser , 1996, COLING.

[40]  Ruslan Mitkov,et al.  Robust Pronoun Resolution with Limited Knowledge , 1998, ACL.

[41]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[42]  Lei Zhang,et al.  Chinese Named Entity Identification Using Class-based Language Model , 2002, COLING.

[43]  Richard Evans,et al.  A New, Fully Automatic Version of Mitkov's Knowledge-Poor Pronoun Resolution Method , 2002, CICLing.

[44]  Jian Wu,et al.  On enhancing katz-smoothing based back-off language model , 2000, INTERSPEECH.

[45]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[46]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[47]  Andrée Vansteelandt The BBI cominatory dictionary of English. A guide to word combinations , 1995 .

[48]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[49]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[50]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[51]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[52]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[53]  Kevin C. Yeh Bilingual Sentence Alignment Based on Punctuation Marks , 2003, ROCLING.

[54]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[55]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[56]  Lee-Feng Chien,et al.  Recent Results on Domain-Specific Term Extraction from Online Chinese Text Resources , 1999, ROCLING.

[57]  John Hale,et al.  A Statistical Approach to Anaphora Resolution , 1998, VLC@COLING/ACL.

[58]  Chao-Huang Chang,et al.  HMM-Based Part-of-Speech Tagging for Chinese Corpora , 1993, VLC@ACL.

[59]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[60]  Charles L. Wayne Topic detection and tracking in English and Chinese , 2000, IRAL '00.

[61]  George A. Miller,et al.  Nouns in WordNet: A Lexical Inheritance System , 1990 .

[62]  Nancy Chinchor,et al.  Statistical Significance of MUC-6 Results , 1995, MUC.

[63]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[64]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[65]  Chu-Ren Huang,et al.  Character-based Collocation for Mandarin Chinese , 1994, COLING.