Event identification within news topics

With the vast amount of information arriving each day, it is necessary to develop automatic techniques for analyzing and handling these huge volumes of information. This problem is addressed by Topic Detection and Tracking (TDT), which organizes news stories by topics, and each topic is viewed as a flat collection of news stories. However, a topic in news is not only a flat collection of news stories but also a set of events. Additionally, there exists a three-layer hierarchy (topic → event → story), which can make people hold the new things that happen in the news easily. Therefore, to recognize the events in topics is significant. Unfortunately, the similarity between two stories, which belong to different events in a topic, is usually high. This is induced by common words occurring in both the two stories. And these common words usually cause events in the same topic to be mutually confusing. To address this problem, we present a novel approach for event identification in this paper. First, we need to remove topic-specific stopwords from each story, then some named-entities are selected as part of features due to their high distinguishable characteristic for identifying events. There is another issue deserving of in-depth consideration. We know weights on different features were empirically determined in the previous work. In our work, we propose a new method to calculate these weights. The experiments are implemented on a Linguistic Data Consortium dataset. The experimental results show that our scheme for event identification has significant improvement over the previous methods.