论文信息 - Event Extraction from Heterogeneous News Sources

Event Extraction from Heterogeneous News Sources

With the proliferation of news articles from thousands of different sources now available on the Web, summarization of such information is becoming increasingly important. Our research focuses on merging descriptions of news events from multiple sources, to provide a concise description that combines the information from each source. Specifically, we describe and evaluate methods for grouping sentences in news articles that refer to the same event. The key idea is to cluster the sentences, using two novel distance metrics. The first distance metric exploits regularities in the sequential structure of events within a document. The second metric uses a TFIDF-like weighting scheme, enhanced to capture word frequencies within events even though the events themselves are not known a priori. Typical news articles contain sentences that do not describe specific events. We use machine learning methods to differentiate between sentences that describe one or more events, and those that do not. We then remove non-event sentences before initiating the clustering process. We demonstrate that this approach achieves significant improvements in overall clustering performance.

[1] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[2] Nicholas Kushmerick,et al. Learning to Attach Semantic Metadata to Web Services , 2003, International Semantic Web Conference.

[3] Gerard Salton,et al. Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[4] Judith L. Klavans,et al. A Flexible Clustering Tool for Summarization , 2001 .

[5] Dragomir R. Radev,et al. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[6] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[7] Bin Wang,et al. A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[8] Hongyuan Zha,et al. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[9] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[10] Colin de la Higuera,et al. Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[11] Dragomir R. Radev,et al. Generating summaries of multiple news articles , 1995, SIGIR '95.

[12] Yuji Matsumoto,et al. A new approach to unsupervised text summarization , 2001, SIGIR '01.