T-Scan: a new tool for analyzing Dutch text

T-Scan is a new tool for analyzing Dutch text. It aims at extracting text features that are theoretically interesting, in that they relate to genre and text complexity, as well as practically interesting, in that they enable users and text producers to make text-specic diagnoses. T-Scan derives it features from tools such as Frog and Alpino, and resources such as SoNaR, SUBTLEX-NL and Referentie Bestand Nederlands. This paper oers a qualitative discussion of a number of T-Scan features, based on a minimal demonstration corpus of six texts, three of them scientic articles and three of them drawn from a women’s magazine. We discuss features concerning lexical complexity, sentence complexity, referential cohesion and lexical diversity, lexical semantics and personal style. For all these domains we examine the construct validity as well as the reliability of a number of important features. We conclude that T-Scan oers a number of promising lexical and syntactic features, while the interpretation of referential cohesion/ lexical diversity features and personal style features is less clear. Further developing the application and analyzing authentic text need to go hand in hand.

[1]  Rie Koizumi,et al.  Relationships between text length and lexical diversity measures: Can we use short texts of less than 100 tokens? , 2012 .

[2]  Hintat Cheung,et al.  Enhancing Older Adults' Reading Comprehension. , 1993 .

[3]  M. Just,et al.  The psychology of reading and language comprehension , 1986 .

[4]  David Temperley,et al.  Minimization of dependency length in written English , 2007, Cognition.

[5]  Anne Vermeer,et al.  Comparing measures of lexical richness , 2007 .

[6]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[7]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[8]  Marc Brysbaert,et al.  SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles , 2010, Behavior research methods.

[9]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[10]  B. K. Britton,et al.  Using Kintsch's computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. , 1991 .

[11]  Richard C. Anderson,et al.  Conceptual and empirical bases of readability formulas , 1986 .

[12]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[13]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[14]  清川 英男,et al.  CHALL, J. S. and DALE, E. (1995) Readability Revisited : The New Dale-Chall Readability Formula., Brookline Books , 1996 .

[15]  W. Kintsch,et al.  Reading comprehension and readability in educational practice and psychological theory , 1979 .

[16]  Hannah M. Nash,et al.  The influence of connectives on young readers' processing and comprehension of text. , 2011 .

[17]  Susan Conrad,et al.  Register, Genre, and Style: Registers, genres, and styles: fundamental varieties of language , 2009 .

[18]  E. Gibson The dependency locality theory: A distance-based theory of linguistic complexity. , 2000 .

[19]  Rebekah George Benjamin Reconstructing Readability: Recent Developments and Recommendations in the Analysis of Text Difficulty , 2012 .

[20]  Walter Kintsch,et al.  Reading rate and retention as a function of the number of propositions in the base structure of sentences , 1973 .

[21]  Sarah Steiner Gender, Genre, and Writing Style in Formal Written Texts , 2014 .

[22]  C. Perfetti,et al.  Linguistic complexity and text comprehension : readability issues reconsidered , 1989 .

[23]  Hanna Zijlstra,et al.  De Nederlandse versie van de 'Linguistic Inquiry and Word Count' (LIWC): Een gecomputeriseerd tekstanalyseprogramma. , 2004 .

[24]  Rogier Kraf,et al.  Leesbaarheidsonderzoek: oude problemen, nieuwe kansen , 2009 .

[25]  Walter Daelemans,et al.  Memory-Based Morphological Analysis , 1999, ACL.

[26]  H. Breland Word Frequency and Word Difficulty: A Comparison of Counts in Four Corpora , 1996 .

[27]  Philip M. McCarthy,et al.  MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment , 2010, Behavior research methods.

[28]  L. Nilsson Perspectives on memory research , 1979 .

[29]  Judith Westphal Irwin,et al.  Effects of explicitness, clause order, and reversibility on children's comprehension of causal relationships. , 1984 .

[30]  Mira Ariel Referring and accessibility , 1988, Journal of Linguistics.

[31]  Helmut Daller,et al.  Modelling and Assessing Vocabulary Knowledge: Fundamental issues , 2007 .

[32]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .