Matrix : a statistical method and software tool for linguistic analysis through corpus comparison

Matrix: A statistical method and software tool for linguistic analysis through corpus comparison A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Paul Edward Rayson, B.Sc. September 2002 This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we describe the standard corpus linguistic methodology, which is hypothesis-driven. The standard research process model is ‘question – build – annotate – retrieve – interpret’, in other words, identifying the research question (and the linguistic features) early in the study. In recent years corpora have been increasingly annotated with linguistic information. From our survey, we find that no tools are available which are datadriven on annotated corpora, in other words, a tool which assists in finding candidate research questions. However, Matrix is such a tool. It allows the macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) as to which linguistic features should be investigated further. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data.

[1]  Hans Martin Lehmann,et al.  Collocational Evidence from the British National Corpus , 2000, Corpora Galore.

[2]  Geoffrey Leech,et al.  Grammatical word class variation within the British National Corpus sampler , 2002 .

[3]  Tony McEnery,et al.  A Corpus/annotation toolbox , 1997 .

[4]  J. R. Firth,et al.  THE TECHNIQUE OF SEMANTICS. , 1935 .

[5]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[6]  Nelleke Oostdijk,et al.  Corpus Linguistics and the Automatic Analysis of English , 1991 .

[7]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[8]  Hamish Cunningham GATE, a General Architecture for Text Engineering , 2002 .

[9]  Paul Rayson,et al.  How to generalise the task of annotation , 1997 .

[10]  S Hockey Concordance Programs for Corpus Linguistics , 2001 .

[11]  Elena Tognini-Bonelli,et al.  Corpus Linguistics at Work , 2002, Computational Linguistics.

[12]  Christopher S. Butler,et al.  Statistics in linguistics , 1985 .

[13]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .

[14]  Eric Atwell,et al.  Dealing with ill-formed English text , 1987 .

[15]  S. Jones,et al.  English lexical collocations - A study in computational linguistics , 1974 .

[16]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[17]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[18]  D. Biber,et al.  Drift and the Evolution of English Style: A History of Three Genres , 1989 .

[19]  Timothy R. C. Read,et al.  Multinomial goodness-of-fit tests , 1984 .

[20]  Antoinette Renouf Explorations in Corpus Linguistics , 1998 .

[21]  G. Yule,et al.  The statistical study of literary vocabulary , 1944 .

[22]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[23]  Irving Lorge,et al.  The semantic count of the 570 commonest English words , 1949 .

[24]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[25]  Larry Wall,et al.  Programming Perl - covers Perl 5, 2nd Edition , 1996, A nutshell handbook.

[26]  C Snow,et al.  Child language data exchange system , 1984, Journal of Child Language.

[27]  Kyo Kageura,et al.  Bigram Statistics Revisited: A Comparative Examination of Some Statistical Measures in Morphological Analysis of Japanese Kanji Sequences , 1999, J. Quant. Linguistics.

[28]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[29]  Paul Rayson,et al.  Automatic Content Analysis of Spoken Discourse , 1992 .

[30]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[31]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[32]  Gunnel Tottie,et al.  English in speech and writing : a symposium , 1986 .

[33]  Erik Smitterberg,et al.  International Corpus of Learner English , 2004 .

[34]  A. Woods,et al.  Statistics in Language Studies , 1986 .

[35]  Gregory P. Knowles,et al.  Manual of information to accompany the SEC corpus , 1988 .

[36]  Glyn Jones,et al.  Concordances in the Classroom , 1990 .

[37]  Marc Weeber,et al.  Extracting the lowest-frequency words: pitfalls and possibilities , 2000, CL.

[38]  Frank Yates Contingency tables involving small numbers and the chi-squared test , 1934 .

[39]  Geoffrey Rockwell,et al.  Tactweb: The Intersection of Text-Analysis and Hypertext , 1997 .

[40]  Sidney Greenbaum,et al.  Comparing English worldwide : the International Corpus of English , 1996 .

[41]  Mohsen Ghadessy,et al.  Small corpus studies and ELT : theory and practice , 2001 .

[42]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[43]  Michael McCarthy,et al.  Vocabulary: Description, Acquisition and Pedagogy , 1990 .

[44]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[45]  Wolfgang Lezius,et al.  An XML-based Representation Format for Syntactically Annotated Corpora , 2000, LREC.

[46]  Nancy Ide,et al.  Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[47]  Timothy R. C. Read,et al.  Goodness-Of-Fit Statistics for Discrete Multivariate Data , 1988 .

[48]  Paul Rayson,et al.  The ACAMRIT semantic tagging system: progress report , 1996 .

[49]  C. Chapelle The Computational Analysis of English—A Corpus‐Based Approach , 1988 .

[50]  John Bibby,et al.  The Analysis of Contingency Tables , 1978 .

[51]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[52]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[53]  S. Hockey Electronic Texts in the Humanities , 2000 .

[54]  Barbara Lewandowska-Tomaszczyk,et al.  PALC'99--Practical Applications in Language Corpora : papers from the international conference at the University of Łódź, 15-18 April 1999 , 2000 .

[55]  Robert J. Gaizauskas,et al.  Coupling information retrieval and information extraction: A new text technology for gathering information from the web , 1997, RIAO.

[56]  R. Harald Baayen,et al.  Statistical models for word frequency distributions: A linguistic evaluation , 1992, Comput. Humanit..

[57]  Michael A. West,et al.  A general service list of English words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology , 1953 .

[58]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[59]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[60]  Clive Souter,et al.  Corpus-Based Computational Linguistics , 1993 .

[61]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[62]  Martin Weisser Programming for Corpus Linguistics: How to Do Text Analysis with Java , 2001 .

[63]  Mike Scott,et al.  3. Comparing corpora and identifying key words, collocations, and frequency distributions through the WordSmith Tools suite of computer programs , 2001 .

[64]  Geoffrey Leech,et al.  Running a grammar factory: The production of syntactically analysed corpora or “treebanks” , 1991 .

[65]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[66]  William B. Stiles,et al.  Describing talk : a taxonomy of verbal response modes , 1992 .

[67]  Terry Winograd,et al.  Understanding natural language , 1974 .

[68]  S. Dawson Keywords: a Vocabulary of Culture and Society , 1976 .

[69]  Geoffrey Leech,et al.  Using corpora for language research : studies in the honour of Geoffrey Leech , 1996 .

[70]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[71]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[72]  Paul Rayson,et al.  Template analysis: bridging the gap between grammar and the lexicon , 1996 .

[73]  Sylviane Granger,et al.  Automatic Profiling of Learner Texts , 1998 .

[74]  Susan Hunston,et al.  Corpora in Applied Linguistics , 2002 .

[75]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[76]  Mike Scot REVIEW OF MONOCONC PRO AND WORDSMITH TOOLS , 2001 .

[77]  Russell V. Lenth,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[78]  Ian H. Witten,et al.  Lexically-generated subject hierarchies for browsing large collections , 1999, International Journal on Digital Libraries.

[79]  Atro Voutilainen A Short History of Tagging , 1999 .

[80]  Stig Johansson,et al.  Some aspects of the vocabulary of learned and scientific English , 1978 .

[81]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[82]  Raymond Williams Keywords: A Vocabulary of Culture and Society , 1976 .

[83]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[84]  Adam Kilgarriff,et al.  Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[85]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[86]  Shishir Gundavaram,et al.  CGI Programming on the World Wide Web , 1996 .

[87]  Geoffrey Leech,et al.  The Use of Tagging , 1999 .

[88]  Daniel Jurafsky,et al.  Verb Subcategorization Frequency Differences between Business- News and Balanced Corpora: The Role of Verb Sense , 2000, ACL 2000.

[89]  John Strang Programming with Curses , 1986 .

[90]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[91]  Ian Sommerville,et al.  MOG User Interface Builder: A Mechanism for Integrating Application and User Interface , 1993, Interact. Comput..

[92]  George R. Doddington CSR Corpus Development , 1992, HLT.

[93]  Tony McEnery,et al.  Swearing and abuse in modern British English , 2000 .

[94]  Alphonse G. Juilland,et al.  Frequency dictionary of Rumanian words , 1964 .

[95]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[96]  Timothy R. C. Read,et al.  Pearsons-X2 and the loglikelihood ratio statistic-G2: a comparative review , 1989 .

[97]  Magnus Ljung,et al.  A frequency dictionary of English morphemes , 1974 .

[98]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[99]  Sylviane Granger,et al.  The computer learner corpus: a versatile new source of data for SLA research , 1998 .

[100]  Charles Carpenter Fries,et al.  English word lists : a study of their adaptability for instruction , 1965 .

[101]  Jan Svartvik,et al.  Directions in corpus linguistics : proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991 , 1992 .

[102]  Adam Kilgarriff,et al.  Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .

[103]  M. Stubbs British Traditions in Text Analysis — From Firth to Sinclair , 1993 .

[104]  Adam Kilgarriff,et al.  Measures for Corpus Similarity and Homogeneity , 1998, EMNLP.

[105]  C. Mehta,et al.  A network algorithm for the exact treatment of Fisher's exact test in RxC contingency tables , 1983 .

[106]  Ian Marshall,et al.  Choice of grammatical word-class without global syntactic analysis: Tagging words in the lob corpus , 1983, Comput. Humanit..

[107]  Christopher S. Butler,et al.  Computers and written texts , 1992 .

[108]  Colin Good Attitudes Towards Europe: Language in the Unification Process , 2001 .

[109]  John Sinclair Corpus typology : a framework for classification , 1995 .

[110]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[111]  Sylviane Granger,et al.  Learner English on Computer , 1998 .

[112]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[113]  Andrew Wilson Towards an Integration of Content Analysis and Discourse Analysis: The Automatic Linkage of Key Relations in Text , 1993 .

[114]  Stig Johansson,et al.  English computer corpora : selected papers and research guide , 1991 .

[115]  M. Stubbs Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture , 1996 .

[116]  Mike Scott,et al.  Mapping key words to problem and solution , 2001 .

[117]  Anthony McEnery,et al.  Parallel alignment in English and Chinese , 2000 .

[118]  Geoffrey Leech,et al.  Standards for Tagsets. , 1999 .

[119]  L. Burnard,et al.  Genres, keywords, teaching: towards a pedagogic account of the language of project proposals , 2000 .

[120]  R Kawecki,et al.  The use of an on-line trilingual corpus for the teaching of reading comprehension in French , 2001 .

[121]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[122]  Anthony McEnery,et al.  Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora , 2000 .

[123]  Jeremy Fox Computers in English language teaching and research: Leech, Geoffrey and Candlin, Christopher N. (eds.), London: Longman, 1986, 230 pp., £5.90. (Applied Linguistics and Language Study) , 1986 .

[124]  Toru Hisamitsu,et al.  Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures: A comparative evaluation of bigram statistics , 2001 .

[125]  F. Yates,et al.  Tests of Significance for 2 × 2 Contingency Tables , 1984 .

[126]  Tony McEnery Database Design For Corpus Storage : The ET 1063 Data Model , 1993 .

[127]  Nicola Guarino,et al.  Formal ontology, conceptual analysis and knowledge representation , 1995, Int. J. Hum. Comput. Stud..

[128]  Mick Short,et al.  Using Corpora for Language Research , 1998 .

[129]  Tony McEnery,et al.  Multilingual resources for European languages: contributions of the CRATER project , 1997 .

[130]  John B. Carroll,et al.  The American Heritage Word Frequency Book , 1971 .

[131]  Lorna Hughes,et al.  CTI Centre for Textual Studies Resources Guide , 1994 .

[132]  Hamish Cunningham,et al.  A definition and short history of Language Engineering , 1999, Natural Language Engineering.

[133]  Geoffrey Barnbrook Language and Computers: A Practical Introduction to the Computer Analysis of Language , 1996 .

[134]  Alphonse G. Juilland,et al.  Frequency dictionary of French words , 1971 .

[135]  Anne Wichmann,et al.  Teaching and Language Corpora , 1997 .

[136]  K. Sin,et al.  Language engineering for legal transplantation: Conceptual problems in creating common law Chinese , 1996 .

[137]  Mona Baker,et al.  Text and technology : in honour of John Sinclair , 1993 .

[138]  Roger Garside,et al.  A Probabilistic Parser , 1985, EACL.

[139]  Richard Jones Creating and using a corpus of spoken German , 1997 .

[140]  Paul Rayson,et al.  Higher-level annotation tools , 1997 .

[141]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[142]  Kalina Bontcheva,et al.  Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis , 2000, LREC.

[143]  Anthony A. Lyne The vocabulary of French business correspondence , 1985 .

[144]  Carl W. Roberts,et al.  Text analysis for the social sciences : methods for drawing statistical inferences from texts and transcripts , 1997 .

[145]  John Bradley,et al.  Using Tact With Electronic Texts: A Guide to Text-Analysis Computing Tools : Version 2.1 for MS-DOS and PC DOS , 1996 .

[146]  P.J.M. de Haan,et al.  Corpus-based research into language. In honour of Jan Aarts , 1994 .

[147]  Nancy Priest-Dorman Greg Ide,et al.  Corpus Encoding Standard (CES) , 2000 .

[148]  Stig Johansson Word frequency and text type: Some observations based on the LOB corpus of British English texts , 1985, Comput. Humanit..

[149]  Signe Oksefjell Ebeling,et al.  Out of Corpora , 1999 .

[150]  Ken Williams,et al.  The Failure of Pearson's Goodness of Fit Statistic , 1976 .

[151]  Ted Pedersen,et al.  Fishing for Exactness , 1996, ArXiv.

[152]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[153]  Geoffrey Leech,et al.  Spoken English on Computer: Transcription, Mark-Up and Application , 1995 .

[154]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[155]  G. Leech 100 million words of English , 1993, English Today.

[156]  Pam Peters,et al.  New frontiers of corpus research: papers from the Twenty First International Conference on English Language Research on Computerized Corpora Sydney 2000 , 2002 .

[157]  Vincent Ooi,et al.  Collocations in Singaporean-Malaysian English , 2000 .

[158]  Elena Semino,et al.  Using a corpus for stylistics research : speech presentation. , 1996 .

[159]  Hans van Halteren,et al.  Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[160]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[161]  John M. Kirk Corpora galore : analyses and techniques in describing English : papers from the nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998) , 2000 .

[162]  Ralph Grishman,et al.  Computational linguistics : an introduction , 1986 .

[163]  Heles Contreras,et al.  Frequency Dictionary of Spanish Words , 1964 .

[164]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[165]  S. Johansson,et al.  Word Frequencies in British and American English , 1985 .

[166]  Bas Aarts,et al.  The verb in contemporary English , 1995 .

[167]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[168]  Benny Brodda Doing corpus work with PC Beta; or, how to be your own computational linguist , 1991 .

[169]  Roel Popping Computer Programs for the Analysis of Texts and Transcripts , 1997 .

[170]  Scott Deerwester,et al.  English in computer science : a corpus-based lexical analysis , 1994 .

[171]  C. D. Paice Information retrieval and the computer , 1977 .

[172]  Anthony McEnery,et al.  Multilingual Corpora In Teaching And Research. , 2000 .

[173]  Mike Scott,et al.  PC analysis of key words — And key key words , 1997 .

[174]  Catherine N. Ball Automated Text Analysis: Cautionary Tales , 1993 .

[175]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[176]  Jeremy M. R. Martin,et al.  The Oxford Concordance Program Version 2 , 1987 .

[177]  H. Dahl Word frequencies of spoken American English , 1979 .

[178]  Gunnel Melchers,et al.  Studies in Anglistics , 1995 .

[179]  Peter Sawyer,et al.  Assisting requirements engineering with semantic document analysis , 2000, RIAO.

[180]  R. Schiffer Psychobiology of Language , 1986 .

[181]  G. Francis A Corpus-Driven Approach to Grammar — Principles, Methods and Examples , 1993 .

[182]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[183]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[184]  Sylvie De Cock,et al.  A Recurrent Word Combination Approach to the Study of Formulae in the Speech of Native and Non-Native Speakers of English , 1998 .

[185]  Geoffrey Leech,et al.  Introducing corpus annotation , 1997 .

[186]  Joe Zhou,et al.  Phrasal Terms in Real-World IR Applications , 1999 .