Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction

Most studies that make use of keyword analysis rely on log-likelihood ratio or chi-square tests to extract words that are particularly characteristic of a corpus (e.g. Scott & Tribble 2006). These measures are computed on the basis of absolute frequencies and cannot account for the fact that "corpora are inherently variable internally" (Gries 2007). To overcome this limitation, measures of dispersion are sometimes used in combination with keyness values (e.g. Rayson 2003; Oakes & Farrow 2007). Some scholars have also suggested using other statistical measures (e.g. Wilcoxon-Mann-Whitney test) but these techniques have not gained corpus linguists' favour (yet?). One possible explanation for this lack of enthusiasm is that statistical tests for keyword extraction have rarely been compared. In this article, we make use of the log-likelihood ratio, the t-test and the Wilcoxon-Mann-Whitney test in turn to compare the academic and the fiction sub-corpora of the British National Corpus and extract words that are typical of academic discourse. We compare the three lists of academic keywords on a number of criteria (e.g. number of keywords extracted by each measure, percentage of keywords that are shared in the three lists, frequency and distribution of academic keywords in the two corpora) and explore the specificities of the three statistical measures. We also assess the advantages and disadvantages of these measures for the extraction of general academic words.

[1]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[2]  J. Pennebaker,et al.  Language use of depressed and depression-vulnerable college students , 2004 .

[3]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[4]  Masao Utiyama,et al.  Selecting level-specific specialized vocabulary using statistical measures , 2006 .

[5]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[6]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[7]  Michael Halliday,et al.  Cohesion in English , 1976 .

[8]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[9]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[10]  Toni C. M. Rietveld,et al.  Pitfalls in Corpus Research , 2004, Comput. Humanit..

[11]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[12]  D Noel Patterns and meanings: Using corpora for English language research and teaching. By ALAN PARTINGTON. (Studies in corpus linguistics 2.) Amsterdam & Philadelphia: John Benjamins, 1998 , 2002 .

[13]  Donald P. Spence Lawfulness In Lexical Choice: A Natural Experiment , 1980, Journal of the American Psychoanalytic Association.

[14]  I. S. P. Nation,et al.  Learning Vocabulary in Another Language: Appendixes , 2001 .

[15]  Stefan Th. Gries Null-hypothesis significance testing of word frequencies: a follow-up on Kilgarriff , 2005 .

[16]  Sylviane Granger,et al.  Lexical verbs in academic discourse: a corpus-driven study of learner use , 2009 .

[17]  Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora: Implications for ESP Development. , 2001 .

[18]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[19]  Paul Georg Meyer Coming to know: Studies in the lexical semantics and pragmatics of academic English , 1997 .

[20]  P. Schnurr,et al.  Diagnostic classification through content analysis of patients' speech. , 1988, The American journal of psychiatry.

[21]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[22]  Chu-Ren Huang,et al.  Distributional Consistency: As a General Method for Defining a Core Lexicon , 2004, LREC.

[23]  L. Burnard,et al.  Genres, keywords, teaching: towards a pedagogic account of the language of project proposals , 2000 .

[24]  Stefan Th. Gries,et al.  Exploring variability within and between corpora: some methodological considerations , 2006 .

[25]  A. Kilgarriff Comparing Corpora , 2001 .

[26]  A. M. Martin Teaching Academic Vocabulary to Foreign Graduate Students , 1976 .

[27]  Donald E. Hardy Textual Patterns: Key Words and Corpus Analysis in Language Education , 2007 .

[28]  S. Ziebland,et al.  Gender, cancer experience and internet use: a comparative keyword analysis of interviews and online cancer support groups. , 2006, Social science & medicine.

[29]  Anthony McEnery,et al.  Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora , 2000 .

[30]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[31]  S. De Cock Patterns and Meanings: Using Corpora for English Language Research and Teaching , 2001 .

[32]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[33]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[34]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[35]  Amir Zeldes Tony McEnery, Richard Xiao & Yukio Tono. 2006. Corpus-Based Language Studies. An Advanced Resource Book (Routledge Applied Linguistics). London, New York: Routledge. xx, 386 S , 2010 .

[36]  H. Scarborough,et al.  Lexical correlates of cervical cancer. , 1978, Social science & medicine.

[37]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[38]  Paul Edward Rayson,et al.  Matrix : a statistical method and software tool for linguistic analysis through corpus comparison , 2003 .

[39]  R. Bergmann,et al.  Different Outcomes of the Wilcoxon—Mann—Whitney Test from Different Statistics Packages , 2000 .

[40]  S. Gries Dispersions and adjusted frequencies in corpora , 2008 .

[41]  Robert C. Moore On Log-Likelihood-Ratios and the Significance of Rare Events , 2004, EMNLP.

[42]  Magali Paquot,et al.  EAP vocabulary in native and learner writing : from extraction to analysis : a phraseology-oriented approach , 2007 .

[43]  D. C. Howell Statistical Methods for Psychology , 1987 .

[44]  Malcolm Farrow,et al.  Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries , 2007, Lit. Linguistic Comput..

[45]  Olga Mudraya,et al.  Engineering English: A lexical frequency instructional model , 2006 .

[46]  Paul Baker Querying Keywords , 2004 .

[47]  Magali Paquot Towards a productively-oriented academic word list , 2007 .

[48]  D. Y. Lee Defining Core Vocabulary and Tracking Its Distribution across Spoken and Written Genres , 2001 .