Corpus tools for lexicographers

To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a 'corpus tool'. A good corpus tool is the key to a comprehensive lexicographic analysis—a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be much larger than previous ones. Most of the first COBUILD dictionary was produced from a corpus of eight million words. Several of the leading English dictionaries of the 1990s were produced using the British National Corpus (BNC), of 100 million words. Current lexico-graphic projects we are involved in use corpora of around a billion words—though this is still less than one hundredth of one percent of the English language text available on the Web (see Rundell, this volume). The amount of data to analyse has thus increased significantly, and corpus tools have had to be improved to assist lexicographers in adapting to this change. Corpus tools have become faster, more multifunctional, and customizable. In the COBUILD project, getting concordance output took a long time and then the concordances were printed on paper and handed out to lexicographers (Clear 1987). Today, with Google as a point of comparison, concordancing needs to be instantaneous, with the analysis taking place on the computer screen. Moreover, larger corpora offer much higher numbers of concordance lines per word (especially for high-frequency words), and, considering the time constraints of the lexicographers (see Rundell, this volume), new features of data summarization are required to ease and speed the analysis. In this chapter, we review the functionality of corpus tools used by lexicographers. In Section 3.2, we discuss the procedures in corpus preparation that are required for some of these features to work. In Section 3.3, we briefly describe some leading tools

[1]  B. Boguraev Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.

[2]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[3]  Jennifer Pearson The Expression of Definitions in Specialised Texts: a Corpus-based Analysis , 1996 .

[4]  Mike Scott Wordsmith Tools version 3 , 1997 .

[5]  Mike Scott,et al.  PC analysis of key words — And key key words , 1997 .

[6]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  Adam Kilgarriff,et al.  Lexical profiling software and its lexicographic applications: a case study , 2002 .

[9]  J. Sinclair Trust the text , 2002 .

[10]  J. Harmer Macmillan English Dictionary for Advanced Learners , 2002 .

[11]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[12]  Gilles-Maurice de Schryver,et al.  TshwaneLex, a state-of-the-art dictionary compilation program , 2004 .

[13]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[14]  James Pustejovsky,et al.  A Pattern Dictionary for Natural Language Processing , 2005 .

[15]  Mike Scott,et al.  Textual Patterns: Key words and corpus analysis in language education , 2006 .

[16]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[17]  B. T. S. Atkins,et al.  The Oxford Guide to Practical Lexicography , 2008 .

[18]  Pavel Rychlý,et al.  A Lexicographer-Friendly Association Score , 2008, RASLAN.

[19]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[20]  Adam Kilgarriff,et al.  GDEX: Automatically Finding Good Dictionary Examples in a Corpus , 2008 .

[21]  Diana Lea,et al.  Making a thesaurus for learners of English , 2008 .

[22]  Eckhard Bick DeepDict-A Graphical Corpus-based Dictionary of Word Relations , 2009, NODALIDA.

[23]  Mark Davies The 385+ million word Corpus of Contemporary American English (1990―2008+): Design, architecture, and linguistic insights , 2009 .

[24]  L. Burnard British National Corpus (BNC) , 2009 .

[25]  Adam Kilgarriff,et al.  A Quantitative Evaluation of Word Sketches , 2010 .

[26]  Iztok Kosem Designing a model for a corpus-driven dictionary of academic English , 2010 .

[27]  Adam Kilgarriff,et al.  Automating the creation of dictionaries: Where will it all end? , 2011 .

[28]  M. Hoey,et al.  Lexical Priming , 2022, The Encyclopedia of Applied Linguistics.