Event Extraction from Heterogeneous News Sources

With the proliferation of news articles from thousands of different sources now available on the Web, summarization of such information is becoming increasingly important. Our research focuses on merging descriptions of news events from multiple sources, to provide a concise description that combines the information from each source. Specifically, we describe and evaluate methods for grouping sentences in news articles that refer to the same event. The key idea is to cluster the sentences, using two novel distance metrics. The first distance metric exploits regularities in the sequential structure of events within a document. The second metric uses a TFIDF-like weighting scheme, enhanced to capture word frequencies within events even though the events themselves are not known a priori. Typical news articles contain sentences that do not describe specific events. We use machine learning methods to differentiate between sentences that describe one or more events, and those that do not. We then remove non-event sentences before initiating the clustering process. We demonstrate that this approach achieves significant improvements in overall clustering performance.