Extracting and Displaying Temporal and Geospatial Entities from Articles on Historical Events

This paper discusses a system that extracts and displays temporal and geospatial entities in text. The first task involves identification of all events in a document followed by identification of important events using a classifier. The second task involves identifying named entities associated with the document. In particular, we extract geospatial named entities. We disambiguate the set of geospatial named entities and geocode them to determine the correct coordinates for each place name, often called grounding. We resolve ambiguity based on sentence and article context. Finally, we present a user with the key events and their associated people, places and organizations within a document in terms of a timeline and a map. For purposes of testing, we use Wikipedia articles about historical events, such as those describing wars, battles and invasions. We focus on extracting major events from the articles, although our ideas and tools can be easily used with articles from other sources such as news articles. We use several existing tools such as Evita, Google Maps, publicly available implementationsofSupportVectorMachines,HiddenMarkovModelandConditionalRandomField, and the MIT SIMILE Timeline.

[1]  J. Kalita,et al.  Language and Domain-Independent Named Entity Recognition : Experiment using SVM and High-Dimensional Features , 2007 .

[2]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[3]  Jeremy Witmer,et al.  Extracting Geospatial Entities from Wikipedia , 2009, 2009 IEEE International Conference on Semantic Computing.

[4]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[5]  R. Brunet La composition des modèles dans l'analyse spatiale , 1980 .

[6]  Piek T. J. M. Vossen,et al.  Event Models for Historical Perspectives: Determining Relations between High and Low Level Events in Text, Based on the Classification of Time, Location and Participants. , 2010, LREC.

[7]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[8]  Estela Saquete Boró,et al.  TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2 , 2010, *SEMEVAL.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  James Pustejovsky,et al.  Temporal Processing with the TARSQI Toolkit , 2008, COLING.

[11]  Robert Weibel,et al.  Spatial information retrieval and geographical ontologies an overview of the SPIRIT project , 2002, SIGIR '02.

[12]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[13]  Robert J. Gaizauskas,et al.  Event coreference for information extraction , 1997 .

[14]  James Allan,et al.  Extracting significant time varying features from text , 1999, CIKM '99.

[15]  Paul U. Lee,et al.  Wayfinding choremes - a language for modeling conceptual route knowledge , 2005, J. Vis. Lang. Comput..

[16]  Gérard Ligozat,et al.  From language to pictorial representations , 2007 .

[17]  Sanda M. Harabagiu,et al.  Answer Mining by Combining Extraction Techniques with Abductive Reasoning , 2003, Text Retrieval Conference.

[18]  James Pustejovsky,et al.  TimeML: Robust Specification of Event and Temporal Expressions in Text , 2003, New Directions in Question Answering.

[19]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[20]  Lise Getoor,et al.  Entity resolution in geospatial data integration , 2006, GIS '06.

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  Eric Brill,et al.  Automatic Question Answering: Beyond the Factoid , 2004, NAACL.

[23]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Bruno Martins,et al.  Geographically aware Web text mining , 2009 .

[26]  George Hripcsak,et al.  A temporal constraint structure for extracting temporal information from clinical narrative , 2006, J. Biomed. Informatics.

[27]  T D'Roza,et al.  An Overview of Location-Based Services , 2003 .

[28]  Arno Scharl,et al.  The Geospatial Web: How Geobrowsers, Social Software and the Web 2.0 are Shaping the Network Society , 2007, The Geospatial Web.

[29]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[30]  James F. Allen,et al.  Event and Temporal Expression Extraction from Raw Text: First Step towards a Temporally Aware System , 2010, Int. J. Semantic Comput..

[31]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[32]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[33]  Jeremy Witmer,et al.  A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia , 2010, 2010 IEEE Fourth International Conference on Semantic Computing.

[34]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[35]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[36]  Ali Mamat,et al.  Named Entity Recognition Using a New Fuzzy Support Vector Machine , 2008 .

[37]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[38]  Eduard Hovy,et al.  A question/answer typology with surface text patterns , 2002 .

[39]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[40]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[41]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[42]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[43]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[44]  Daniel S. Weld,et al.  Temporal Information Extraction , 2010, AAAI.

[45]  Jeremy Witmer,et al.  Mining Wikipedia Article Clusters for Geospatial Entities and Relationships , 2009, AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0.

[46]  Jugal K. Kalita,et al.  Scalable biomedical Named Entity Recognition: investigation of a database-supported SVM approach , 2010, Int. J. Bioinform. Res. Appl..

[47]  Michael Gertz,et al.  Extraction and exploration of spatio-temporal information in documents , 2010, GIR.

[48]  James F. Allen,et al.  TRIPS and TRIOS System for TempEval-2: Extracting Temporal Information from Text , 2010, *SEMEVAL.

[49]  Xing Xie,et al.  Detecting geographic locations from web resources , 2005, GIR '05.

[50]  David Ahn,et al.  The stages of event extraction , 2006 .

[51]  Breck Baldwin,et al.  Cross-Document Event Coreference: Annotations, Experiments, and Observations , 1999, COREF@ACL.

[52]  James Allan,et al.  Temporal summaries of new topics , 2001, SIGIR '01.

[53]  Branimir Boguraev,et al.  Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts , 1997 .

[54]  James V. Candy,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[55]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[56]  Øyvind Vestavik Geographic Information Retrieval : An Overview , 2004 .

[57]  Gérard Ligozat,et al.  Spatiotemporal Aspects of the Monitoring of Complex Events for Public Security Purposes , 2011, Spatial Cogn. Comput..

[58]  Sanda M. Harabagiu,et al.  LCC Tools for Question Answering , 2002, TREC.

[59]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[60]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[61]  J. Kalita,et al.  Extracting and Displaying Temporal Entities from Historical Articles , 2011 .

[62]  Dunja Mladenic,et al.  Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[63]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[64]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[65]  Dan Wu,et al.  On assigning place names to geography related web pages , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[66]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[67]  Rittwik Jana,et al.  Geotracker: geospatial and temporal RSS navigation , 2007, WWW '07.

[68]  Frederico T. Fonseca,et al.  Using Ontologies for Integrated Geographic Information Systems , 2002, Trans. GIS.

[69]  Paolo Rosso,et al.  Inferring Geographical Ontologies from Multiple Resources for Geographical Information Retrieval , 2006, GIR.

[70]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[71]  James Pustejovsky,et al.  Evita: A Robust Event Recognizer For QA Systems , 2005, HLT.

[72]  Joe Carthy,et al.  Investigating Statistical Techniques for Sentence-Level Event Classification , 2008, COLING.

[73]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[74]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[75]  Max J. Egenhofer,et al.  Similarity of Cardinal Directions , 2001, SSTD.

[76]  Jugal Kalita,et al.  Improving scalability of support vector machines for biomedical named entity recognition , 2008 .

[77]  James Allan,et al.  Topic Detection and Tracking , 2002, The Information Retrieval Series.

[78]  K. Clarke Getting Started with Geographic Information Systems , 1996 .

[79]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[80]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[81]  José Luis Borbinha,et al.  Extracting and Exploring the Geo-Temporal Semantics of Textual Resources , 2008, 2008 IEEE International Conference on Semantic Computing.

[82]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[83]  Piek T. J. M. Vossen,et al.  Historical Event Extraction from Text , 2011, LaTeCH@ACL.

[84]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[85]  Agnès Voisard,et al.  Spatial Databases: With Application to GIS , 2001 .

[86]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[87]  Alia I. Abdelmoty,et al.  Ontology-Based Spatial Query Expansion in Information Retrieval , 2005, OTM Conferences.

[88]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[89]  Michael Gertz,et al.  HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions , 2010, *SEMEVAL.

[90]  Tomek Strzalkowski,et al.  HITIQA: An Interactive Question Answering System: A Preliminary Report , 2003, ACL 2003.

[91]  Linda L. Hill,et al.  Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints , 2000, ECDL.