Measuring Coherence 1 Running head: TEXTUAL COHERENCE USING LATENT SEMANTIC ANALYSIS The Measurement of Textual Coherence with Latent Semantic Analysis

Latent Semantic Analysis is used as a technique for measuring the coherence of texts. By comparing the vectors for two adjoining segments of text in a highdimensional semantic space, the method provides a characterization of the degree of semantic relatedness between the segments. We illustrate the approach for predicting coherence through re-analyzing sets of texts from two studies that manipulated the coherence of texts and assessed readers' comprehension. The results indicate that the method is able to predict the effect of text coherence on comprehension and is more effective than simple term-term overlap measures. In this manner, LSA can be applied as an automated method that produces coherence predictions similar to propositional modeling. We describe additional studies investigating the application of LSA to analyzing discourse structure and examine the potential of LSA as a psychological model of coherence effects in text comprehension. Measuring Coherence 3 The Measurement of Textual Coherence with Latent Semantic Analysis. In order to comprehend a text, a reader must create a well connected representation of the information in it. This connected representation is based on linking related pieces of textual information that occur throughout the text. The linking of information is a process of determining and maintaining coherence. Because coherence is a central issue to text comprehension, a large number of studies have investigated the process readers use to maintain coherence and to model the readers' representation of the textual information as well as of their previous knowledge (e.g., Lorch & O'Brien, 1995) There are many aspects of a discourse that contribute to coherence, including, coreference, causal relationships, connectives, and signals. For example, Kintsch and van Dijk (Kintsch, 1988; Kintsch & van Dijk, 1978) have emphasized the effect of coreference in coherence through propositional modeling of texts. While coreference captures one aspect of coherence, it is highly correlated with other coherence factors such as causal relationships found in the text (Fletcher, Chrysler, van den Broek, Deaton, & Bloom, 1995; Trabasso, Secco & van den Broek, 1984). Although a propositional model of a text can predict readers' comprehension, a problem with the approach is that in-depth propositional analysis is time consuming and requires a considerable amount of training. Semi-automatic methods of propositional coding (e.g., Turner, 1987) still require a large amount of effort. This degree of effort limits the size of the text that can be analyzed. Thus, most texts analyzed and used in reading comprehension experiments have been small, typically from 50 to 500 words, and almost all are under 1000 words. Automated methods such as readability measures (e.g., Flesch, 1948; Klare, 1963) provide another characterization of the text, however, they do not correlate well with comprehension measures (Britton & Gulgoz, 1991; Kintsch & Vipond, 1979). Thus, while the coherence of a text can be measured, it can often involve considerable effort. In this study, we use Latent Semantic Analysis (LSA) to determine the coherence of texts. A more complete description of the method and approach to using LSA may be found in Deerwester, Dumais, Furnas, Landauer and Harshman, (1990), Landauer and Dumais, (1997), as well as in the preceding article by Landauer, Foltz and Laham (this issue). LSA provides a fully automatic method for comparing units of textual information to each other in order to determine their semantic relatedness. These units of text are compared to each other using a derived measure of their similarity of meaning. This measure is based on a Measuring Coherence 4 powerful mathematical analysis of direct and indirect relations among words and passages in a large training corpus. Semantic relatedness so measured, should correspond to a measure of coherence since it captures the extent to which two text units are discussing semantically related information. Unlike methods which rely on counting literal word overlap between units of text, LSA's comparisons are based on a derived semantic relatedness measure which reflects semantic similarity among synonyms, antonyms, hyponyms, compounds, and other words that tend to be used in similar contexts. In this way, it can reflect coherence due to automatic inferences made by readers as well as to literal surface coreference. In addition, since LSA is automatic, there are no constraints on the size of the text analyzed. This permits analyses of much larger texts to examine aspects of their discourse structure. In order for LSA to be considered an appropriate approach for modeling text coherence, we first establish how well LSA captures elements of coherence that are similar to modeling methods such as propositional models. A re-analysis of two studies that examined the role of coherence in readers' comprehension is described. This re-analysis of the texts produces automatic predictions of the coherence of texts which are then compared to measures of the readers' comprehension. We next describe the application of the method to investigating other features of the discourse structure of texts. Finally, we illustrate how the approach applies both as a tool for text researchers and as a theoretical model of text coherence. General approach for using LSA to measure coherence The primary method for using LSA to make coherence predictions is to compare some unit of text to an adjoining unit of text in order to determine the degree to which the two are semantically related. These units could be sentences, paragraphs or even individual words or whole books. This analysis can then be performed for all pairs of adjoining text units in order to characterize the overall coherence of the text. Coherence predictions have typically been performed at a propositional level, in which a set of propositions all contained within working memory are compared or connected to each other (e.g., Kintsch, 1988, In press). For LSA coherence analyses, using sentences as the basic unit of text appears to be an appropriate corresponding level that can be easily parsed by automated methods. Sentences serve as a good level in that they represent a small set of textual information (e.g., typically 3-7 propositions) and thus would be approximately consistent with the amount of information that is held in short term memory. Measuring Coherence 5 As discussed in the preceding article by Landauer, et al. (this issue), the power of computing semantic relatedness with LSA comes from analyzing a large number of text examples. Thus, for computing the coherence of a target text, it may first be necessary to have another set of texts that contain a large proportion of the terms used in the target text and that have occurrences in many contexts. One approach is to use a large number of encyclopedia articles on similar topics as the target text. A singular value decomposition (SVD) is then performed on the term by article matrix, thereby generating a high dimensional semantic space which contains most of the terms used in the target text. Individual terms, as well as larger text units such as sentences, can be represented as vectors in this space. Each text unit is represented as the weighted average of vectors of the terms it contains. Typically the weighting is by the log entropy transform of each term (see Landauer, et al., this issue). This weighting helps account for both the term's importance in the particular unit as well as the degree to which the term carries information in the domain of discourse in general. The semantic relatedness of two text units can then be compared by determining the cosine between the vectors for the two units. Thus, to find the coherence between the first and second sentence of a text, the cosine between the vectors for the two sentences would be determined. For instance, two sentences that use exactly the same terms with the same frequencies will have a cosine of 1, while two sentences that use no terms that are semantically related, will tend to have cosines near 0 or below. At intermediate levels, sentences containing terms of related meaning, even if none are the same terms or roots will have more moderate cosines. (It is even possible, although in practice very rare, that two sentences with no words of obvious similarity will have similar overall meanings as indicated by similar LSA vectors in the high dimensional semantic space.) Coherence and text comprehension This paper illustrates a complementary approach to propositional modeling for determining coherence, using LSA, and comparing the predicted coherence to measures of the readers' comprehension. For these analyses, the texts and comprehension measures are taken from two previous studies by Britton and Gulgoz (1988), and, McNamara, et al. (1996). In the first study, the text coherence was manipulated primarily by varying the amount of sentence to sentence repetition of particular important content words through analyzing propositional overlap. Simulating its results with LSA demonstrates the degree to which coherence is carried, or at least reflected, in the Measuring Coherence 6 continuity of lexical semantics, and shows that LSA correctly captures these effects. However, for these texts, a simpler literal word overlap measure, absent any explicit propositional or LSA analysis, also predicts comprehension very well. The second set of texts, those from McNamara et al. (1996), manipulates coherence in much subtler ways; often by substituting words and phrases of related meaning but containing different lexical items to provide the conceptual bridges between one sentence and the next. These materials provide a much more rigorous and interesting test of the LSA technique by requiring it to detect underlying meaning similarities in the absence of literal word repetition. The success of this simulation, and its superiority to d

[1]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[2]  George R. Klare,et al.  The measurement of readability , 1963 .

[3]  Walter Kintsch,et al.  Reading rate and retention as a function of the number of propositions in the base structure of sentences , 1973 .

[4]  Michael Halliday,et al.  Cohesion in English , 1976 .

[5]  Walter Kintsch,et al.  Toward a model of text comprehension and production. , 1978 .

[6]  W. Kintsch,et al.  Reading comprehension and readability in educational practice and psychological theory , 1979 .

[7]  Walter Kintsch,et al.  Readability and recall of short prose passages: A theoretical analysis. , 1980 .

[8]  T. Trabasso Causal Cohesion and Story Coherence. , 1982 .

[9]  W. Kintsch,et al.  Strategies of discourse comprehension , 1983 .

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[12]  W. Kintsch The role of knowledge in discourse comprehension: a construction-integration model. , 1988, Psychological review.

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  B. K. Britton,et al.  Using Kintsch's computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. , 1991 .

[15]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[16]  Behavior research methods, instruments, & computers , 1991 .

[17]  Walter Kintsch,et al.  A cognitive architecture for comprehension. , 1992 .

[18]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[19]  E. J. O'Brien,et al.  Sources of coherence in reading , 1995 .

[20]  W. Kintsch,et al.  Are Good Texts Always Better? Interactions of Text Coherence, Background Knowledge, and Levels of Understanding in Learning From Text , 1996 .

[21]  Peter W. Foltz,et al.  Latent semantic analysis for text-based research , 1996 .

[22]  Peter W. Foltz,et al.  Reasoning from Multiple Texts: An Automatic Analysis of Readers? Situation Models , 1996 .

[23]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[24]  Walter Kintsch,et al.  Comprehension: A Paradigm for Cognition , 1998 .

[25]  Peter W. Foltz,et al.  Learning from text: Matching readers and texts by latent semantic analysis , 1998 .