Syntactic concordancing and multi-word expression detection

Concordancers are tools that display the contexts of a given word in a corpus. Also called key word in context (KWIC), these tools are nowadays indispensable in the work of lexicographers, linguists, and translators. We present an enhanced type of concordancer that integrates syntactic information on sentence structure as well as statistical information on word cooccurrence in order to detect and display those words from the context that are most strongly related to the word under investigation. This tool considerably alleviates the users' task, by highlighting syntactically well-formed word combinations that are likely to form complex lexical units, i.e., multi-word expressions. One of the key distinctive features of the tool is its multilingualism, as syntax-based multi-word expression detection is available for multiple languages and parallel concordancing enables users to consult the version of a source context in another language, when multilingual parallel corpora are available. In this article, we describe the underlying methodology and resources used by the system, its architecture, and its recently developed online version. We also provide relevant performance evaluation results for the main system components, focusing on the comparison between syntax-based and syntax-free approaches.

[1]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[2]  Daniel Gildea,et al.  The Necessity of Parsing for Predicate Argument Recognition , 2002, ACL.

[3]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[4]  Jean Véronis,et al.  Evaluation of parallel text alignment systems , 2000 .

[5]  Eric Wehrli,et al.  Creating a multilingual collocations dictionary from large text corpora , 2003, EACL.

[6]  Paul Rayson Wmatrix : a web-based corpus processing environment , 2022 .

[7]  Mirella Lapata,et al.  Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05) , 2005, ACL 2005.

[8]  Violeta Seretan,et al.  Collocation extraction based on syntactic parsing , 2008 .

[9]  Mona Diab,et al.  Verb noun construction MWE token supervised classification , 2009 .

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  S. Evert,et al.  Can we do better than frequency ? A case study on extracting PP-verb collocations , 2001 .

[12]  Joakim Nivre,et al.  Inductive Dependency Parsing (Text, Speech and Language Technology) , 2006 .

[13]  Eric Wehrli,et al.  A Recursive Treatment of Collocations , 2010, LREC.

[14]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[15]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[16]  Richard Poole,et al.  Oxford collocations dictionary for students of English , 2009 .

[17]  Eric Wehrli,et al.  Fips, A “Deep” Linguistic Multilingual Parser , 2007, ACL 2007.

[18]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[19]  Mike Dillinger,et al.  Collocation Extraction for Machine Translation , 2003 .

[20]  J. Bresnan Lexical-Functional Syntax , 2000 .

[21]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[22]  Ming Zhou,et al.  Collocation Translation Acquisition Using Monolingual Corpora , 2004, ACL.

[23]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[24]  Agnès Tutin Pour une modélisation dynamique des collocations dans les textes , 2004 .

[25]  Ulrich Heid,et al.  Significant Triples: Adjective+Noun+Verb Combinations , 2003 .

[26]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[27]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[28]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[29]  Geoffrey Barnbrook Language and Computers: A Practical Introduction to the Computer Analysis of Language , 1996 .

[30]  Pavel Pecina Lexical Association Measures: Collocation Extraction , 2008 .

[31]  Mike Scott Wordsmith Tools version 3 , 1997 .

[32]  Joakim Nivre,et al.  Inductive Dependency Parsing , 2006, Text, speech and language technology.

[33]  Y. Berglund British National Corpus , 2014 .

[34]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[35]  J. Zwart The Minimalist Program , 1998, Journal of Linguistics.

[36]  Jane Harvey Review: Hoffmann, Evert, Smith, Lee and Berglund Prytz (2008) ‘Corpus Linguistics with BNCweb – a Practical Guide’. Frankfurt am Main: Peter Lang , 2010 .

[37]  K. W. Church You shall know a word by the company it keeps , 1995 .

[38]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[39]  Robert Dale,et al.  Handbook of Natural Language Processing , 2001, Computational Linguistics.

[40]  Adam Kilgarriff,et al.  A Quantitative Evaluation of Word Sketches , 2010 .

[41]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[42]  J. Véronis,et al.  Evaluation of parallel text alignment systems The ARCADE project , 2000 .

[43]  David Lee,et al.  Corpus Linguistics with BNCweb - a Practical Guide , 2008, English corpus linguistics.

[44]  Lou Burnard,et al.  Xara : an XML aware tool for corpus searching , 2003 .

[45]  Eric Wehrli,et al.  Sentence Analysis and Collocation Identification , 2010, MWE@COLING.

[46]  Ulrich Heid,et al.  A Survey of Idiomatic Preposition-Noun-Verb Triples on Token Level , 2010, LREC.

[47]  Yves Scherrer,et al.  On-line and off-line translation aids for non-native readers , 2009, 2009 International Multiconference on Computer Science and Information Technology.

[48]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[49]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[50]  Á. Makkai Idiom structure in English , 1972 .

[51]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[52]  Philip Resnik,et al.  The Linguist's Search Engine: An Overview , 2005, ACL.

[53]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[54]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[55]  Eric Wehrli,et al.  Extraction of multi-word collocations using syntactic bigram composition , 2003 .

[56]  Sophia Ananiadou,et al.  A linguistic approach to terminological context clustering , 1999 .

[57]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[58]  C. Fillmore,et al.  Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone , 1988 .

[59]  Simon Charest,et al.  Élaboration automatique d’un dictionnaire de cooccurrences grand public , 2007, JEPTALNRECITAL.

[60]  SmadjaFrank Retrieving collocations from text , 1993 .