Statistics in Corpus Linguistics

The book gives step-by-step guidance through the process of statistical analysis and provides multiple examples of how statistical techniques can be used to analyse and visualise linguistic data. It also includes a useful selection of discussion questions and exercises which you can use to check your understanding. The book comes with a Companion website, which provides additional materials (answers to exercises, datasets, advanced materials, teaching slides etc.) and Lancaster Stats Tools online (http://corpora.lancs.ac.uk/stats), a free click-and-analyse statistical tool for easy calculation of the statistical measures discussed in the book.

[1]  Sali A. Tagliamonte Analysing Sociolinguistic Variation , 2006 .

[2]  Tony McEnery,et al.  Sketching Muslims: A Corpus Driven Analysis of Representations Around the Word 'Muslim' in the British Press 1998-2009 , 2013 .

[3]  Paul Rayson,et al.  From key words to key semantic domains , 2008 .

[4]  Tony McEnery,et al.  Corpus Linguistics: Method, Theory and Practice , 1996 .

[5]  G. Upton Fisher's Exact Test , 1992 .

[6]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[7]  A. McEnery Swearing in English: Bad Language, Purity and Power from 1586 to the Present , 2004 .

[8]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[9]  G. Lakoff,et al.  Metaphors We Live By , 1980 .

[10]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[11]  Chong Ho Yu,et al.  A data visualization and data mining approach to response and non-response analysis in survey research , 2007 .

[12]  H. Charles Romesburg,et al.  Cluster analysis for researchers , 1984 .

[13]  Geoffrey Williams Collocational networks: Interlocking patterns of lexis in a Corpusof plant biology research articles , 1998 .

[14]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[15]  L. Hedges,et al.  Introduction to Meta‐Analysis , 2009, International Coaching Psychology Review.

[16]  Stefan Th. Gries,et al.  Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition , 2009, Lit. Linguistic Comput..

[17]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[18]  Yadolah Dodge The Concise Encyclopedia of Statistics , 2008 .

[19]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[20]  Jeffrey T. Leek,et al.  Statistics: P values are just the tip of the iceberg , 2015, Nature.

[21]  S. Gries Dispersions and adjusted frequencies in corpora , 2008 .

[22]  Douglas Biber,et al.  Register, Genre, and Style , 2019 .

[23]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[24]  J. P. Verma Repeated measures design for empirical researchers , 2015 .

[25]  William R. Shadish,et al.  Combining estimates of effect size. , 1994 .

[26]  Stefan Th. Gries,et al.  50-something years of work on collocations: What is or should be next … , 2013 .

[27]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[28]  Mikhail Nikulin,et al.  Chi-Squared Goodness of Fit Tests with Applications , 2013 .

[29]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[30]  W. Labov Principles Of Linguistic Change , 1994 .

[31]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[32]  Chandler Stolp,et al.  The Visual Display of Quantitative Information , 1983 .

[33]  Anna Marchi,et al.  Keyness: Appropriate metrics and practical issues , 2012 .

[34]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[35]  Jason W. Osborne,et al.  Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data , 2012 .

[36]  Ramesh Krishnamurthy,et al.  English Collocation Studies: The OSTI Report , 2004 .

[37]  Panagiotis Papapetrou,et al.  Significance testing of word frequencies in corpora , 2016, Digit. Scholarsh. Humanit..

[38]  Paul Baker American and British English: Divided by a Common Language? , 2017 .

[39]  Michael Stubbs,et al.  Words and Phrases: Corpus Studies of Lexical Semantics , 2001 .

[40]  K. Gwet Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity , 2002 .

[41]  R. Lakoff Language and woman's place , 1973, Language in Society.

[42]  Noam Chomsky,et al.  New Horizons in the Study of Language and Mind: Naturalism and dualism in the study of language and mind , 2008 .

[43]  Paul Rayson,et al.  Extending the Cochran rule for the comparison of word frequencies between corpora , 2004 .

[44]  Ann Colley,et al.  Gender-Linked Differences in the Style and Content of E-Mails to Friends , 2002 .

[45]  S. Gries,et al.  The identification of stages in diachronic data: variability-based neighbour clustering , 2008 .

[46]  Andy P. Field,et al.  Discovering Statistics Using SPSS , 2000 .

[47]  Tony McEnery,et al.  The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations , 2017 .

[48]  N. Millar,et al.  Modal verbs in TIME: Frequency changes 1923-2006 , 2009 .

[49]  Carla J. Groom,et al.  Gender Differences in Language Use: An Analysis of 14,000 Text Samples , 2008 .

[50]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[51]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[52]  Christer Geisler,et al.  Statistical reanalysis of corpus data , 2008 .

[53]  Michael Barlow,et al.  Individual differences and usage-based grammar , 2013 .

[54]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[55]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[56]  Sven Kepes,et al.  Avoiding Bias in Publication Bias Research: The Value of “Null” Findings , 2014 .

[57]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[58]  Douglas Biber,et al.  On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora , 2016 .

[59]  D. Kerby The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation1: , 2014 .

[60]  Stefan Thomas Gries,et al.  Statistics for linguistics with R: A practical introduction (review) , 2012 .

[61]  Jeffrey C. Valentine Judging the quality of primary research. , 2009 .

[62]  M. Greenacre Correspondence analysis in practice , 1993 .

[63]  Edward Tufte,et al.  Visual Explanations , 1997 .

[64]  Martin Hilpert,et al.  Dynamic visualizations of language change: Motion charts on the basis of bivariate and multivariate data from diachronic corpora , 2011 .

[65]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[66]  Terttu Nevalainen,et al.  Historical Sociolinguistics: Language Change in Tudor and Stuart England , 2016 .

[67]  Tony McEnery,et al.  Exploring Learner Language Through Corpora: Comparing and Interpreting Corpus Frequency Information , 2017 .

[68]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[69]  Vaclav Brezina,et al.  Significant or random?: A critical review of sociolinguistic generalisations based on large corpora , 2014 .

[70]  Pedro M. Valero-Mora,et al.  Determining the Number of Factors to Retain in EFA: An easy-to-use computer program for carrying out Parallel Analysis , 2007 .

[71]  Peter Y. Chen,et al.  Correlation: Parametric and Nonparametric Measures , 2002 .

[72]  W. Labov The social stratification of English in New York City , 1969 .

[73]  Jennifer J. Richler,et al.  Effect size estimates: current use, calculations, and interpretation. , 2012, Journal of experimental psychology. General.

[74]  S. Gries Dispersions and adjusted frequencies in corpora: further explorations , 2010 .

[75]  James C. Hayton,et al.  Factor Retention Decisions in Exploratory Factor Analysis: a Tutorial on Parallel Analysis , 2004 .

[76]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[77]  N. Coupland Style: Language Variation and Identity , 2007 .

[78]  A. D. Gordon,et al.  Correspondence Analysis Handbook. , 1993 .

[79]  J. Firth Papers in linguistics , 1958 .

[80]  Peter Sprent,et al.  Fisher Exact Test , 2011, International Encyclopedia of Statistical Science.

[81]  C. A. Boneau,et al.  The effects of violations of assumptions underlying the test. , 1960, Psychological bulletin.

[82]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[83]  Susan Conrad,et al.  Speaking and Writing in the University: A Multidimensional Comparison , 2002 .

[84]  Andrew J. Vickers,et al.  What is a p-value anyway? 34 Stories to Help You Actually Understand Statistics , 2009 .

[85]  Terttu Nevalainen,et al.  CEECing the baseline: lexical stability and significant change in a historical corpus , 2012 .

[86]  R. Kirk Practical Significance: A Concept Whose Time Has Come , 1996 .

[87]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[88]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[89]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[90]  Geoffrey Leech,et al.  The modals ARE declining: Reply to Neil Millar’s “Modal verbs in TIME: Frequency changes 1923–2006”, International Journal of Corpus Linguistics 14:2 (2009), 191–220 , 2011 .

[91]  Tony McEnery,et al.  A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press , 2008 .

[92]  D. Salsburg The lady tasting tea : how statistics revolutionized science in the twentieth century , 2002 .

[93]  H. Davies,et al.  When can odds ratios mislead? , 1998, BMJ.

[94]  Jason W. Osborne,et al.  Best Practices in Logistic Regression , 2014 .

[95]  Michael R. Chernick,et al.  An Introduction to Bootstrap Methods with Applications to R , 2011 .

[96]  Thom Hudson,et al.  Presenting Quantitative Data Visually , 2015 .

[97]  B. Efron Computers and the Theory of Statistics: Thinking the Unthinkable , 1979 .

[98]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[99]  Emanuel Schmider,et al.  Is It Really Robust , 2010 .

[100]  Cindy M. Walker,et al.  Categorical Data Analysis for the Behavioral and Social Sciences , 2010 .

[101]  J. H. McMillan,et al.  Studies of the Effect of Formative Assessment on Student Achievement: So Much More Is Needed. , 2013 .

[102]  Stefan Th. Gries,et al.  Ways of trying in Russian: clustering behavioral profiles , 2006, Corpus Linguistics and Linguistic Theory.

[103]  Tony McEnery,et al.  Collocations in context:a new perspective on collocation networks , 2015 .

[104]  Gene V. Glass,et al.  A RANKING VARIABLE ANALOGUE OF BISERIAL CORRELATION: IMPLICATIONS FOR SHORT‐CUT ITEM ANALYSIS , 1965 .

[105]  Anthony McEnery,et al.  Corpus Linguistics and 17th-Century Prostitution , 2018 .

[106]  M. Aldenderfer,et al.  Cluster Analysis. Sage University Paper Series On Quantitative Applications in the Social Sciences 07-044 , 1984 .

[107]  Paul Baker Times May Change, But We Will Always Have Money: Diachronic Variation in Recent British English , 2011 .

[108]  D. Vaux,et al.  Error bars in experimental biology , 2007, The Journal of Cell Biology.

[109]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[110]  J. Pennebaker,et al.  Are Women Really More Talkative Than Men? , 2007, Science.

[111]  Beatriz R. Lavandera Where does the sociolinguistic variable stop? , 1978, Language in Society.

[112]  Sali A. Tagliamonte,et al.  Well weird, right dodgy, very strange, really cool: Layering and recycling in English intensifiers , 2003, Language in Society.

[113]  John T. E. Richardson,et al.  Eta Squared and Partial Eta Squared as Measures of Effect Size in Educational Research. , 2011 .

[114]  Paul Baker The BE06 Corpus of British English and recent language change , 2009 .

[115]  William S. Cleveland The elements of graphing data , 1980 .

[116]  Adam Kilgarriff,et al.  The TenTen Corpus Family , 2013 .

[117]  D. Crystal,et al.  English as a Global Language , 1998 .

[118]  W. Grabe,et al.  Aspects of text structure : an investigation of the lexical organisation of text , 1987 .

[119]  Heles Contreras,et al.  Frequency Dictionary of Spanish Words , 1964 .

[120]  Michael P. Cohen Note on the Odds Ratio and the Probability Ratio , 2000 .

[121]  J. Winter Practical Assessment, Research, and Evaluation Practical Assessment, Research, and Evaluation Using the Student's t-test with extremely small sample sizes Using the Student's t-test with extremely small sample sizes , 2022 .

[122]  K. Popper,et al.  The Logic of Scientific Discovery , 1960 .

[123]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[124]  Brian Everitt,et al.  Cluster analysis , 1974 .

[125]  John E. Hunter,et al.  Methods of Meta-Analysis: Correcting Error and Bias in Research Findings , 1991 .

[126]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[127]  Adam Kilgarriff,et al.  Putting frequencies in the dictionary , 1997 .

[128]  Michael Friendly,et al.  A Brief History of the Mosaic Display , 2002 .

[129]  M. Borenstein Effect sizes for continuous data. , 2009 .

[130]  Pawel Lewicki,et al.  Statistics : methods and applications : a comprehensive reference for science, industry, and data mining , 2006 .

[131]  Douglas Biber,et al.  Register as a predictor of linguistic variation , 2012 .

[132]  D. Gardner Validating the Construct of Word in Applied Corpus-based Vocabulary Research: A Critical Survey , 2007 .

[133]  Daniel Ezra Johnson,et al.  Getting off the GoldVarb Standard: Introducing Rbrul for Mixed-Effects Variable Rule Analysis , 2009, Lang. Linguistics Compass.

[134]  S. Gries,et al.  Modeling diachronic change in the third person singular: a multifactorial, verb- and author-specific exploratory approach1 , 2010, English Language and Linguistics.

[135]  S. Gries,et al.  Some Proposals towards a More Rigorous Corpus Linguistics , 2006 .

[136]  Sten-Erik Clausen,et al.  Applied correspondence analysis : an introduction , 1998 .

[137]  Peter J. Diggle,et al.  Statistics and Scientific Method: An Introduction for Students and Researchers , 2011 .

[138]  Tony Berber Sardinha,et al.  Multi-dimensional analysis, 25 years on : a tribute to Douglas Biber , 2014 .

[139]  S. Edgell,et al.  Effect of violation of normality on the t test of the correlation coefficient. , 1984 .

[140]  A. Kilgarriff Simple Maths for Keywords , 2009 .

[141]  Richard A. William Blythe,et al.  S-curves and the mechanisms of propagation in language change , 2012 .

[142]  Tony McEnery,et al.  Collocations in Corpus‐Based Language Learning Research: Identifying, Comparing, and Interpreting the Evidence , 2017 .

[143]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[144]  David Malvern,et al.  Investigating accommodation in language proficiency interviews using a new measure of lexical diversity , 2002 .

[145]  Mike Scott Wordsmith Tools version 3 , 1997 .

[146]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[147]  Scott Jarvis,et al.  Capturing the Diversity in Lexical Diversity. , 2013 .

[148]  K. Pearson NOTES ON THE HISTORY OF CORRELATION , 1920 .

[149]  Susan Conrad,et al.  Multi-dimensional methodology and the dimensions of register variation in English , 2014 .

[150]  E. Tufte Beautiful Evidence , 2006 .

[151]  Mike Scott,et al.  PC analysis of key words — And key key words , 1997 .

[152]  D. Crystal English as a global language: Contents , 2003 .

[153]  Yves Bestgen Inadequacy of the chi-squared test to examine vocabulary differences between corpora , 2014, Lit. Linguistic Comput..

[154]  Roger E. Kirk,et al.  Effect Size, Measures of , 2015, The SAGE Encyclopedia of Research Design.

[155]  Angus B. Grieve-Smith The Envelope of Variation in Multidimensional Register and Genre Analyses , 2007 .

[156]  Vaclav Brezina,et al.  Is There a Core General Vocabulary? Introducing the "New General Service List". , 2015 .

[157]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[158]  Michael A. Covington,et al.  Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR) , 2010, J. Quant. Linguistics.

[159]  Petr Savický,et al.  Measures of Word Commonness , 2002, J. Quant. Linguistics.

[160]  Samir Okasha,et al.  Philosophy of Science: A Very Short Introduction , 2002 .

[161]  Robert Rosenthal WRITING META-ANALYTIC REVIEWS , 1995 .

[162]  T. Lumley,et al.  The importance of the normality assumption in large public health data sets. , 2002, Annual review of public health.

[163]  Mark Davies,et al.  A Frequency Dictionary of Contemporary American English: Word Sketches, Collocates and Thematic Lists , 2010 .

[164]  Paul Baker,et al.  Using Corpora to Analyze Gender , 2014 .

[165]  Elaine W Vine High frequency multifunctional words: accuracy of word-class tagging , 2011 .

[166]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[167]  R. Cardinal,et al.  ANOVA for the Behavioral Sciences Researcher , 2005 .

[168]  Nathan Yau,et al.  Visualize This: The FlowingData Guide to Design, Visualization, and Statistics , 2011 .

[169]  R. Xiao Multidimensional analysis and the study of world Englishes , 2009 .

[170]  Antti Arppe,et al.  Univariate, bivariate, and multivariate methods in corpus-based lexicography : A study of synonymy , 2008 .

[171]  Helen Samantha Baker,et al.  Corpus Linguistics and 17th-Century Prostitution: Computational Linguistics and History , 2016 .

[172]  James K. Jones,et al.  Quantitative methods in corpus linguistics , 2009 .

[173]  Marco Baroni,et al.  37. Distributions in text , 2009 .

[174]  J. Harrington,et al.  Does the Queen speak the Queen's English? , 2000, Nature.

[175]  Tony McEnery,et al.  Epistemic Stance in Spoken L2 English: The Effect of Task and Speaker Style , 2017 .

[176]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[177]  Alphonse G. Juilland,et al.  Frequency dictionary of French words , 1971 .

[178]  Adam Kilgarriff,et al.  Getting to Know Your Corpus , 2012, TSD.

[179]  David M Erceg-Hurn,et al.  Modern robust statistical methods: an easy way to maximize the accuracy and power of your research. , 2008, The American psychologist.

[180]  David J Hand,et al.  Evaluating diagnostic tests: The area under the ROC curve and the balance of errors , 2010, Statistics in medicine.

[181]  Simon Urbanek,et al.  Interactive graphics for Data Analysis - Principles and Examples , 2008, Computer science and data analysis series.

[182]  Marco Baroni,et al.  Building general- and special-purpose corpora by Web crawling , 2006 .

[183]  Anthony McEnery,et al.  Ireland in British parliamentary debates 1803–2005:Plotting changes in discourse in a large volume of time-series corpus data , 2017 .