Size matters: a quantative approach to corpus representativeness

Corpus Linguistics (CL) has not yet come of age. It does not make any difference whether we consider it a full-fledged linguistic discipline (Tognini-Bonelli 2000: 1) or, else, a set of analytical techniques that can be applied to any discipline (McEnery et al. 2006: 7). The truth is that CL is still striving to solve thorny, central issues such as optimum size, balance and representativeness of corpora (of the language as a whole or of some subset of the language). Corpus-driven/based studies rely on the quality and representativeness of each corpus as their true foundation for producing valid results. This entails deciding on valid external and internal criteria for corpus design and compilation. A basic tenet is that corpus representativeness determines the kinds of research questions that can be addressed and the generalizability of the results obtained (cf. Biber et al. 1988: 246). Unfortunately, faith and beliefs do not seem to ensure quality. In this paper we will attempt to deal with these key questions. Firstly, we will give a brief description of the R&D projects which

[1]  Miriam Seghiri Compilación de un corpus trilingüe de seguros turísticos (español-inglés-italiano): aspectos de evaluación, catalogación, diseño y representatividad , 2006 .

[2]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[3]  J. González APLICACIONES AL ANÁLISIS AUTOMÁTICO DEL CONTENIDO PROVENIENTES DE LA TEORÍA MATEMÁTICA DE LA INFORMACIÓN , 2002 .

[4]  Federico Zanettin,et al.  CEXI: Designing an English Italian Translational Corpus , 2002 .

[5]  Pieter de Haan On the exploration of corpus data by means of problem-oriented tagging: Postmodifying clauses in the English noun phrase , 1991 .

[6]  Michael Rundell,et al.  The corpus revolution , 1992 .

[7]  Pascual Cantos,et al.  On the Corpus Size Needed for Compiling a Comprehensive Computational Lexicon by Automatic Lexical Acquisition , 2002, Comput. Humanit..

[8]  Mark Lauer,et al.  Corpus Statistics Meet the Noun Compound: Some Empirical Results , 1995, ACL.

[9]  G. Leech Corpora and theories of linguistic performance , 1992 .

[10]  Aquilino Sánchez,et al.  Predictability of word forms (types) and lemmas in linguistic corpora. A Case Study Based on the Analysis of the CUMBRE Corpus:: an 8-million-word Corpus of contemporary Spanish , 1997 .

[11]  Gloria Corpas Pastor,et al.  Determinación del Umbral de Representatividad de un Corpus mediante el Algoritmo N-Cor , 2007, Proces. del Leng. Natural.

[12]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[13]  Paul Baker,et al.  Using Corpora in Discourse Analysis , 2006 .

[14]  J. L. Myers,et al.  Regression analyses of repeated measures data in cognitive research. , 1990, Journal of experimental psychology. Learning, memory, and cognition.

[15]  B. Boguraev Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.

[16]  Geoffrey Williams In search of representativity in specialised corpora: Categorisation through collocation , 2002 .

[17]  S. Laviosa How Comparable Can 'Comparable Corpora' Be? , 1997 .

[18]  Elena Tognini-Bonelli,et al.  Corpus Linguistics at Work , 2002, Computational Linguistics.

[19]  Randolph Quirk,et al.  On corpus principles and design , 1992 .

[20]  Nicoletta Calzolari,et al.  Current issues in computational linguistics : in honour of Don Walker , 1994 .

[21]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[22]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[23]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[24]  Anabel Borja El texto jurídico inglés y su traducción al español , 2000 .

[25]  Irene Díaz,et al.  ALGORITMO DE FILTRADO MULTI-TÉRMINO PARA LA OBTENCIÓN DE RELACIONES JERÁRQUICAS EN LA CONSTRUCCIÓN AUTOMÁTICA DE UN TESAURO , 1999 .

[26]  C. P. Hernández Terminología basada en corpus: principios teóricos y metodológicos , 2004 .

[27]  Pieter de Haan The optimum corpus sample size , 1992 .

[28]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[29]  C. Fillmore "Corpus linguistics" or "Computer-aided armchair linguistics" , 2008 .

[30]  Pascual Cantos Gómez,et al.  El ritmo incremental de palabras nuevas en los repertorios de textos: Estudio experimental y comparativo basado en dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y española y en cinco autores de ambas lenguas , 1997 .

[31]  Ronald Carter,et al.  Trust the Text: Language, Corpus and Discourse , 2004 .

[32]  Mohsen Ghadessy,et al.  Small corpus studies and ELT : theory and practice , 2001 .

[33]  John Sinclair Corpus typology : a framework for classification , 1995 .

[34]  Dan-Hee Yang Yang,et al.  An Algorithm for Predicting the Relationship between Lemmas and Corpus Size , 2000 .

[35]  Sue Ellen Wright,et al.  Handbook of terminology management. , 2001 .

[36]  Jennifer Pearson,et al.  Working with Specialized Language: A Practical Guide to Using Corpora , 2002 .

[37]  Khurshid Ahmad,et al.  Corpus Linguistics and Terminology Extraction , 2001 .

[38]  Specialized Corpora for Translators : A Quantitative Method to Determine Representativeness , 2022 .

[39]  Mark Lauer How much is enough?: Data requirements for statistical NLP , 1995, ArXiv.

[40]  Tony McEnery,et al.  Corpus-Based Language Studies: An Advanced Resource Book , 2006 .

[41]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[42]  Josse de Kock Gramática y corpus: los pronombres demostrativos , 1997 .

[43]  Gloria Corpas Pastor Traducir con corpus: de la teoría a la práctica , 2002 .

[44]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[45]  Chantal Pérez Hernández,et al.  Explotación de los córpora textuales informatizados para la creación de bases de datos terminológicas basadas en el conocimiento. , 2002 .

[46]  Nicholas Ostler,et al.  Corpus Design Criteria , 1992 .

[47]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .

[48]  D. Biber Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation , 1990 .

[49]  Mark Lauer Conserving Fuel in Statistical Language Learning: Predicting Data Requirements , 1995, ArXiv.