Using LSA to Automatically Identify Givenness and Newness of Noun Phrases in Written Discourse Christian F. Hempelmann, David Dufty, Philip M. McCarthy, Arthur C. Graesser, Zhiqiang Cai, and Danielle S. McNamara ({chmplmnn, ddufty, pmmccrth, a-graesser, zcai, dsmcnamr}@memphis.edu) Institute for Intelligent Systems/FedEx Institute of Technology University of Memphis Memphis, TN 38152 To illustrate the basic distinctions of givenness, consider the following example. Abstract Identifying given and new information within a text has long been addressed as a research issue. However, there has previously been no accurate computational method for assessing the degree to which constituents in a text contain given versus new information. This study develops a method for automatically categorizing noun phrases into one of three categories of givenness/newness, using the taxonomy of Prince (1981) as the gold standard. The central computational technique used is span (Hu et al., 2003), a derivative of latent semantic analysis (LSA). We analyzed noun phrases from two expository and two narrative texts. Predictors of newness included span as well as pronoun status, determiners, and word overlap with previous noun phrases. Logistic regression showed that span was superior to LSA in categorizing noun-phrases, producing an increase in accuracy from 74% to 80%. (1) President Bush said on Friday he recognized that there were other solutions to bolster Social Security than his contentious proposal for personal retirement accounts, but they would be part of a broader overhaul of the country’s largest entitlement program. In this example, Social Security is new when it is first mentioned, while the country’s largest entitlement program is coreferential with it. Thus, the constituent the country’s largest entitlement program is given information even though there are lexical differences that have to be bridged inferentially. Retirement accounts, on the other hand, is only inferentially available from Social Security; that is, it is neither fully new nor unexpected in view of the previous mention of Social Security. Thus, retirement accounts is neither given nor new but somewhere in between. We propose that any word in a text must be considered situated on a continuum between wholly given and wholly new. By extension, any phrase, clause or sentence analyzed in whole or part can be assessed for its degree of givenness. Our goal in this paper is thus to explore methods for automatically extracting these degrees of givenness for particular sections of text. However, before discussing computational measures of givenness in more detail, the theoretical basis for the relevant concepts will be addressed in the next section. Introduction Successive constituents in text, such as sentences or noun phrases (NPs), vary in how much new versus given information they contain. This distinction is not binary. For example, it is uncertain how to classify an idea that would have been inferred earlier in the text rather than explicitly stated, as will be discussed later. The aim of this paper is to assess the extent to which givenness and newness can be computed algorithmically from features of the text. Automatic assessment of givenness is useful for a variety of NLP applications, including the assessment of student responses to automatic tutoring systems, paragraph recognition, discourse feature identification, and recall scoring. The present application was devised for implementation in Coh-Metrix (Graesser et al., 2004b), a text- processing tool that provides new methods of automatically assessing text cohesion, readability, and difficulty. When considering the dimension of familiarity, text constituents can be classified into three partitions: given, partially given (based on various types of inferential availability), or not given (that is, new). When developing an automatic system, it is more natural to view new information as that information that is not given, rather than vice versa. So we would need to first need to compute how much given (old) information is in a constituent and then regard the remaining information as new. Therefore, any automated measure that describes how part of a text can be established as given by a reader is valuable as it will increase the amount of identified givenness. Theoretical accounts of the given/new dimension Halliday (1967) defines given information as “recoverable either anaphorically or situationally” from the preceding discourse, and new information, conversely, as not recoverable. Chafe (1975, 1987) defines given information as “knowledge which the speaker assumes to be in the consciousness of the addressee” (1975: 30). In Chafe’s initial binary framework of given and new, given information is previously activated, whereas new information is activated only by the current segment of text. Chafe then introduces a distinction between new, given, and a third category, ‘quasi- given’ (1977: 34). This third category is related to the inferential availability of information, and has been a central concept in modern approaches. Clark and Haviland (1977) extend the distinction using Gricean maxims, proposing a
[1]
W. Chafe.
Givenness, contrastiveness, definiteness, subjects, topics, and point of view
,
1976
.
[2]
E. Prince.
The ZPG Letter: Subjects, Definiteness, and Information-status
,
1992
.
[3]
Davin Caplan.
Clause boundaries and recognition latencies for words in sentences
,
1972
.
[4]
Justine Cassell,et al.
BEAT: the Behavior Expression Animation Toolkit
,
2001,
Life-like characters.
[5]
F. R. Chang.
Active memory processes in visual sentence comprehension: Clause effects and pronominal reference
,
1980,
Memory & cognition.
[6]
M. Potter,et al.
Clauses and the semantic representation of words
,
1985,
Memory & cognition.
[7]
Geert-Jan M. Kruijff,et al.
Discourse-level Annotation for Investigating Information Structure
,
2004,
ACL 2004.
[8]
M. Gernsbacher,et al.
Accessing Sentence Participants: The Advantage of First Mention.
,
1988,
Journal of memory and language.
[9]
Olga Uryupina,et al.
High-precision Identification of Discourse New and Unique Noun Phrases
,
2003,
ACL.
[10]
R. Freedle.
Discourse production and comprehension
,
1978
.
[11]
Renata Vieira,et al.
A Corpus-based Investigation of Definite Description Use
,
1997,
CL.
[12]
Arthur C. Graesser,et al.
Coh-Metrix: Analysis of text on cohesion and language
,
2004,
Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.
[13]
Heather H. Mitchell,et al.
AutoTutor: A tutor with dialogue in natural language
,
2004,
Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.
[14]
Kari Fraurud,et al.
Definiteness and the Processing of Noun Phrases in Natural Discourse
,
1990,
J. Semant..
[15]
Michael Strube,et al.
Never Look Back: An Alternative to Centering
,
1998,
COLING-ACL.
[16]
Russell S. Tomlin,et al.
Coherence and Grounding in Discourse: Outcome of a Symposium, Eugene, Oregon, June 1984
,
1987
.
[17]
Mark Steedman,et al.
Information Structure and the Syntax-Phonology Interface
,
2000,
Linguistic Inquiry.
[18]
Sergei Nirenburg,et al.
Book Review: Ontological Semantics, by Sergei Nirenburg and Victor Raskin
,
2004,
CL.
[19]
W. Chafe.
Cognitive constraints on information flow
,
1984
.
[20]
Ellen F. Prince,et al.
Toward a taxonomy of given-new information
,
1981
.
[21]
Peter W. Foltz,et al.
The Measurement of Textual Coherence with Latent Semantic Analysis.
,
1998
.
[22]
Walter Kintsch,et al.
Toward a model of text comprehension and production.
,
1978
.
[23]
M. Halliday.
NOTES ON TRANSITIVITY AND THEME IN ENGLISH. PART 2
,
1967
.
[24]
Mark Steedman,et al.
Discourse and Information Structure
,
2003,
J. Log. Lang. Inf..
[25]
Peter W. Foltz,et al.
An introduction to latent semantic analysis
,
1998
.
[26]
Peter W. Foltz,et al.
The intelligent essay assessor: Applications to educational technology
,
1999
.
[27]
Renata Vieira,et al.
An Empirically-based System for Processing Definite Descriptions
,
2000,
CL.
[28]
Arthur C. Graesser,et al.
A Revised Algorithm for Latent Semantic Analysis
,
2003,
IJCAI.