Exploring a corpus of scientific texts using data mining

We report on a project investigating the linguistic properties of English scientific texts on the basis of a corpus of journal articles from nine academic disciplines. The goal of the project is to gain insights on registers emerging at the boundaries of computer science and some other discipline (e.g., bioinformatics, computational linguistics, computational engineering). The questions we focus on in this paper are (a) how characteristic is the corpus of the meta-register it represents, and (b) how different/similar are the subcorpora in terms of the more specific registers they instantiate? We analyze the corpus using several data-mining techniques, including feature ranking, clustering, and classification, to see how the subcorpora group in terms of selected linguistic features. The results show that our corpus is well distinguished in terms of the meta-register of scientific writing; also, we find interesting distinctive features for the subcorpora as indicators of register diversification. Apart from presenting the results of our analyses, we will also reflect upon and assess the use of data mining for the tasks of corpus exploration and analysis.

[1]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[2]  Elke Teich,et al.  Cross-linguistic variation in system and text : a methodology for the investigation of translations and comparable texts , 2003 .

[3]  M. Halliday Spoken and Written Language , 1989 .

[4]  Douglas Douglas,et al.  The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings , 1992, Comput. Humanit..

[5]  Christian Mair,et al.  Twentieth-Century English: History, Variation and Standardization , 2006 .

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[8]  Twentieth Century English , 1947 .

[9]  Ian Witten,et al.  Data Mining , 2000 .

[10]  Stefan Th. Gries,et al.  Exploring variability within and between corpora: some methodological considerations , 2006 .

[11]  Elke Teich,et al.  Cross-linguistic variation in system and text , 2003 .

[12]  M. Halliday,et al.  Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective , 1989 .

[13]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Marcin Junczys-Dowmunt Influence of accurate compound noun splitting on bilingual vocabulary extraction , 2008, KONVENS.

[16]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[17]  Jeannett Martin,et al.  Writing Science: Literacy And Discursive Power , 1993 .

[18]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[19]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[20]  P. Strevens,et al.  The Linguistic Sciences And Language Teaching , 1964 .

[21]  Sabine Bartsch,et al.  Exploring automatic theme identification: a rule-based approach , 2008, KONVENS.

[22]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[23]  Mônica Holtz,et al.  Scientific registers in contact: An exploration of the lexico-grammatical properties of interdisciplinary discourses , 2009 .