Predicting Relevance of Event Extraction for the End User

We present work on estimating the relevance of the results of an Event Extraction system to the end-user’s needs. Our aim is to develop user-oriented measures of utility of the extracted events, i.e., how useful is the factual information found in the document for the end user. We introduce discourse and lexical features, and build classifiers that learn from the users’ ratings of the relevance of the extraction results. Traditional criteria for evaluating the performance of Information Extraction (IE) focus on the correctness of the extracted information, e.g., in terms of recall, precision, F-measure, etc. We rather focus on subjective criteria for evaluating the quality of the extracted information: utility of results to the end-user. To measure utility, we use methods from text mining and linguistic analysis to identify features that are good predictors of the relevance of an event or a document. We report on experiments in two real-world event extraction domains: corporate activities reported in business news, and health threats in news about infectious epidemics.

[1]  Kenneth D. Mandl,et al.  HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports , 2008, Journal of the American Medical Informatics Association.

[2]  Alan W. Biermann,et al.  Analyzing the Complexity of a Domain with Respect to an Information Extraction Task , 1997, MUC.

[3]  Kalina Bontcheva,et al.  Ontology-Based Information Extraction for Business Intelligence , 2007, ISWC/ASWC.

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[6]  A. Cvitas,et al.  Information extraction in business intelligence systems , 2010, The 33rd International Convention MIPRO.

[7]  Steinberger Ralf,et al.  Automatic Epidemiological Surveillance from On-line News in MedISys and PULS , 2009 .

[8]  Ralf Steinberger,et al.  Text Mining from the Web for Medical Intelligence , 2007, NATO ASI Mining Massive Data Sets for Security.

[9]  Ralph Grishman,et al.  Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[10]  Ralph Grishman,et al.  Complexity of Event Structure in IE Scenarios , 2002, COLING.

[11]  A. Bell The language of news media , 1991 .

[12]  Ralph Grishman,et al.  Real-time event extraction for infectious disease outbreaks , 2002 .

[13]  Piskorski Jakub,et al.  Mining Massive Data Sets for Security , 2008 .

[14]  Lynette Hirschman,et al.  Language understanding evaluations: lessons learned from MUC and ATIS , 1998, LREC.

[15]  Steinberger Ralf,et al.  Combining Information about Epidemic Threats from Multiple Sources , 2007 .

[16]  Remco R. Bouckaert,et al.  Bayesian network classifiers in Weka , 2004 .

[17]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Matti Vuorinen,et al.  Assessment of Utility in Web Mining for the Domain of Public Health , 2010, Louhi@NAACL-HLT.