Exploiting Structure for Event Discovery Using the MDI Algorithm

Effectively identifying events in unstructured text is a very difficult task. This is largely due to the fact that an individual event can be expressed by several sentences. In this paper, we investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. The key idea is to cluster the sentences, using a novel distance metric that exploits regularities in the sequential structure of events within a document. When this approach is compared to a simple bag of words baseline, a statistically significant increase in performance is observed.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[3]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[4]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[5]  W. Barlow,et al.  A comparison of methods for calculating a stratified kappa. , 1990, Statistics in medicine.

[6]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[7]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Scott E. MAXWELL,et al.  Statistical Methods for Rates and Proportions , 2004 .

[10]  Brian Everitt,et al.  MOMENTS OF THE STATISTICS KAPPA AND WEIGHTED KAPPA , 1968 .

[11]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[12]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[13]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[14]  Hans Uszkoreit,et al.  Automatic Event and Relation Detection with Seeds of Varying Complexity , 2006 .

[15]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[16]  Nicholas Kushmerick,et al.  Event Extraction from Heterogeneous News Sources , 2006 .

[17]  Colin de la Higuera,et al.  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[18]  Nicholas Kushmerick,et al.  Learning to Attach Semantic Metadata to Web Services , 2003, International Semantic Web Conference.