A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

We design a three-layered collocation extraction tool by integrating syntactic and semantic knowledge and apply it in China English studies. The tool first extracts peripheral collocations in the frequency layer from dependency triples, then extracts semi-peripheral collocations in the syntactic layer by association measures, and last extracts core collocations in the semantic layer with a similar word thesaurus. The syntactic constraints filter out much noise from surface co-occurrences, and the semantic constraints are effective in identifying the very “core” collocations. The tool is applied to automatically extract collocations from a large corpus of China English we compile to explore how China English as a variety of English is nativilized. Then we analyze similarities and differences of the typical China English collocations of a group of verbs. The tool and results can be applied in the compilation of language resources for Chinese-English translation and corpus-based China studies.

[1]  Xiaoye You,et al.  The grammatical features of English in a Chinese Internet discussion forum , 2015 .

[2]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[3]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[4]  Udo Hahn,et al.  Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms , 2005, HLT.

[5]  Deyuan He,et al.  Language attitudes and linguistic features in the 'China English' debate , 2009 .

[6]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[7]  Yuji Matsumoto,et al.  Identifying collocations using cross-lingual association measures , 2014, MWE@EACL.

[8]  Stefan Evert,et al.  Towards a Firthian Notion of Collocation , 2014 .

[9]  Wei Yun,et al.  Using English in China , 2003, English Today.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[12]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[13]  Dan Li,et al.  A Hierachical Collocation Extraction Tool , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[14]  Andrée Vansteelandt The BBI cominatory dictionary of English. A guide to word combinations , 1995 .

[15]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[16]  A. Kirkpatrick,et al.  Chinese pragmatic norms and ‘China English’ , 2002 .

[17]  J. Bahns Lexical collocations: a contrastive view , 1993 .

[18]  Joseph James Alvaro Analysing China's English‐language media , 2015 .

[19]  Hang Zhang Bilingual creativity in Chinese English: Ha Jin’s In the Pond , 2002 .

[20]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[21]  Ulrich Heid,et al.  Tools for Collocation Extraction: Preferences for Active vs. Passive , 2008, LREC.

[22]  D. Graddol,et al.  English in China today , 2010, English Today.

[23]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[24]  Lexical innovations in China English , 2005 .

[25]  Braj B. Kachru World Englishes: approaches, issues and resources , 1992, Language Teaching.

[26]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[27]  R. B. Baldauf,et al.  Second language errors and features of world Englishes , 2013 .

[28]  SmadjaFrank Retrieving collocations from text , 1993 .

[29]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[30]  Mike Scott Wordsmith Tools version 3 , 1997 .

[31]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.