Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach

We describe here an algorithm for detecting subject boundaries within text based on a statistical lexical similarity measure. Hearst has already tackled this problem with good results (Hearst, 1994). One of her main assumptions is that a change in subject is accompanied by a change in vocabulary. Using this assumption, but by introducing a new measure of word significance, we have been able to build a robust and reliable algorithm which exhibits improved accuracy without sacrificing language independency. 1 I n t r o d u c t i o n Automatic detection of subject divisions within a text is considered to be a very difficult task even for humans, let alone machines. But such subject divisions are used in more complex tasks in text processing such as text summarisation. An automatic method for marking subject boundaries is highly desirable. Hearst (Hearst, 1994) addresses this problem by applying a statistical method for detecting subjects within text. Hearst describes an algorithm for what she calls Text Tiling, which is a method for detecting subject boundaries within a text. The underlying assumption of this algorithm is that there is a high probability that words which are related to a certain subject will be repeated whenever that subject is mentioned. Another basic assumption is that when a new subject emerges the choice of vocabulary will change, and will stay consistent within the subject boundaries until the next change in subject. These basic notions of vocabulary consistency within subject boundaries lead to a method for dividing text based on calculating vocabulary similarity between two adjacent windows of text. Each potential subject boundary is identified and assigned a correspondence value based on the lexical similarity between two windows of text, one on either side of the subject boundary. The values for all potential boundaries are plotted on a graph, creating peaks and troughs. The troughs represent changes in vocabulary use and therefore, according to the underlying assumption, a change in subject. A division mark is inserted where a significant local minimum is detected on the graph. Hearst measured approximately 80% success in detection of subject boundaries on some texts. We decided to adopt Hearst's underlying assumption that a change in subject will entail a change in vocabulary. Our aim was to make the algorithm as language independent and computationally expedient as possible , while also improving accuracy and reliability.