This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically mot ivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the T D T Corpus, a collection of newswire articles and broadcast news transcripts. 1 I n t r o d u c t i o n The task we address in this paper might seem on the face of it rather elementary: identify where one region of text ends and another begins. This work was motivated by the observations that such a seemingly simple problem can actually prove quite difficult to automate, and that a tool for partitioning a stream of undifferentiated text (or multimedia) into coherent regions would be of great benefit to a number of existing applications. The task itself is ill-defined: what exactly is meant by a "region" of text? We confront this issue by *Research supported in part by NSF grant IRI9314969, DARPA AASERT award DAAH04-95-1-0475, and the ATR Interpreting Telecommunications Research Laboratories. adopting an empirical definition of segment. At our disposal is a collection of online data (38 million words of Wall Street Journal archives and another 150 million words from selected news broadcasts) annotated with the boundaries between regions-articles or news reports, respectively. Given this input, the task of constructing a segmenter may be cast as a problem in machine learning: glean from the data a set of hints about where boundaries occur, and use these hints to inform a decision on where to place breaks in unsegmented data. A general-purpose tool for partitioning expository text or multimedia data into coherent regions would have a number of immediate practical uses. In fact, this research was inspired by a problem in information retrieval: given a large unpartit ioned collection of expository text and a user's query, return a collection of coherent segments matching the query. Lacking a segmenting tool, an II:t application may be able to locate positions in its database which are strong matches with the user's query, but be unable to determine how much of the surrounding data to provide to the user. This can manifest itself in quite unfortunate ways. For example, a video-on-demand application (such as the one described in (Christel et al., 1995)) responding to a query about a recent news event may provide the user with a news clip related to the event, followed or preceded by part of an unrelated story or even a commercial. Document summarization is another fertile area for an automatic segmenter. Summarization tools often work by breaking the input into "topics" and then summarizing each topic independently. A segmentation tool has obvious applications to the first of these tasks. The output of a segmenter could also serve as input to various language-modeling tools. For instance, one could envision segmenting a corpus, classifying the segments by topic, and then constructing topic-dependent language models from the generated classes. The paper will proceed as follows. In Section 2 we
[1]
Ronald Rosenfeld,et al.
Adaptive Language Modeling Using the Maximum Entropy Principle
,
1993,
HLT.
[2]
John D. Lafferty,et al.
A Model of Lexical Attraction and Repulsion
,
1997,
ACL.
[3]
Slava M. Katz,et al.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
,
1987,
IEEE Trans. Acoust. Speech Signal Process..
[4]
Takeo Kanade,et al.
Informedia Digital Video Library
,
1995,
CACM.
[5]
G. G. Stokes.
"J."
,
1890,
The New Yale Book of Quotations.
[6]
Marti A. Hearst.
Multi-Paragraph Segmentation Expository Text
,
1994,
ACL.
[7]
Radford M. Neal.
Connectionist Learning of Belief Networks
,
1992,
Artif. Intell..
[8]
Bernard Mérialdo,et al.
A Dynamic Language Model for Speech Recognition
,
1991,
HLT.
[9]
R. Stephenson.
A and V
,
1962,
The British journal of ophthalmology.
[10]
Hideki Kozima,et al.
Text Segmentation Based on Similarity between Words
,
1993,
ACL.
[11]
G. Youmans.
A New Tool for Discourse Analysis: The Vocabulary-Management Profile.
,
1991
.
[12]
Rebecca J. Passonneau,et al.
Combining Multiple Knowledge Sources for Discourse Segmentation
,
1995,
ACL.
[13]
Hideki Kozima,et al.
Segmenting Narrative Text into Coherent Scenes
,
1994
.
[14]
Renato De Mori,et al.
A Cache-Based Natural Language Model for Speech Recognition
,
1990,
IEEE Trans. Pattern Anal. Mach. Intell..
[15]
Adam L. Berger,et al.
A Maximum Entropy Approach to Natural Language Processing
,
1996,
CL.
[16]
John D. Lafferty,et al.
Inducing Features of Random Fields
,
1995,
IEEE Trans. Pattern Anal. Mach. Intell..
[17]
John Lafferty,et al.
A Model of Lexical Attraction and Repulsion
,
1997,
Annual Meeting of the Association for Computational Linguistics.