Generic method for detecting focus time of documents

Statistical approach for estimating the focus time of text documents.Classification framework for categorizing documents into temporal and atemporal.Bi-Temporal Document Representation using document focus time and creation time. Time is an important aspect of text documents. While some documents are atemporal, many have strong temporal characteristics and contain contents related to time. Such documents can be mapped to their corresponding time periods. In this paper, we propose estimating the focus time of documents which is defined as the time period to which document's content refers and which is considered complementary dimension to the document's creation time. We propose several estimators of focus time by utilizing statistical knowledge from external resources such as news article collections. The advantage of our approach is that document focus time can be estimated even for documents that do not contain any temporal expressions or contain only few of them. We evaluate the effectiveness of our methods on the diverse datasets of documents about historical events related to 5 countries. Our approach achieves average error of less than 21years on collections of Wikipedia pages, extracts from history-related books and web pages, while using the total time frame of 113years. We also demonstrate an example classification method to distinguish temporal from atemporal documents.

[1]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[2]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[3]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[4]  Dimitrios Gunopulos,et al.  A burstiness-aware approach for document dating , 2014, SIGIR.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Pierre Senellart,et al.  Deriving Dynamics of Web Pages: A Survey , 2011, TWAW.

[7]  Claudia Niederée,et al.  What triggers human remembering of events? A large-scale analysis of catalysts for collective memory in Wikipedia , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[8]  S. Sheather Density Estimation , 2004 .

[9]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[10]  Ricardo Campos,et al.  Survey of Temporal Information Retrieval and Related Applications , 2014, ACM Comput. Surv..

[11]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[12]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[13]  Luis Gravano,et al.  Answering General Time-Sensitive Queries , 2008, IEEE Transactions on Knowledge and Data Engineering.

[14]  Irem Arikan,et al.  Time Will Tell: Leveraging Temporal Expressions in IR , 2009, WSDM.

[15]  Cristina Ribeiro,et al.  Using neighbors to date web documents , 2007, WIDM '07.

[16]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[17]  Uzay Kaymak,et al.  An Overview of Event Extraction from Text , 2011, DeRiVE@ISWC.

[18]  Adam Jatowt,et al.  Studying how the past is remembered: towards computational history through large scale text mining , 2011, CIKM '11.

[19]  Delphine Bernhard,et al.  When Was It Written? Automatically Determining Publication Dates , 2011, SPIRE.

[20]  Lars R. Clausen,et al.  Concerning Etags and Datestamps , 2004 .

[21]  Cristina Ribeiro,et al.  Use of Temporal Expressions in Web Search , 2008, ECIR.

[22]  Craig A. Knoblock,et al.  A Survey of Digital Map Processing Techniques , 2014, ACM Comput. Surv..

[23]  Adam Jatowt,et al.  Extracting collective expectations about the future from large text collections , 2011, CIKM '11.

[24]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[25]  Kjetil Nørvåg,et al.  Using Temporal Language Models for Document Dating , 2009, ECML/PKDD.

[26]  Djoerd Hiemstra,et al.  Temporal Language Models for the Disclosure of Historical Text , 2005 .

[27]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[28]  Kjetil Nørvåg,et al.  Determining Time of Queries for Re-ranking Search Results , 2010, ECDL.

[29]  Michael Gertz,et al.  TimeTrails: A System for Exploring Spatio-Temporal Information in Documents , 2010, Proc. VLDB Endow..

[30]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[31]  Adam Jatowt,et al.  Estimating document focus time , 2013, CIKM.

[32]  Paweł Mazur,et al.  Broad-Coverage Rule-Based Processing of Temporal Expressions , 2012 .

[33]  Hai Leong Chieu,et al.  Query based event extraction along a timeline , 2004, SIGIR '04.

[34]  Michael Gertz,et al.  Temporal Information Retrieval: Challenges and Opportunities , 2011, TWAW.

[35]  Fuchun Peng,et al.  Improving search relevance for implicitly temporal queries , 2009, SIGIR.

[36]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[37]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[38]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[39]  David A. Smith,et al.  Detecting and Browsing Events in Unstructured text , 2002, SIGIR '02.

[40]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[41]  Gerhard Weikum,et al.  A Language Modeling Approach for Temporal Information Needs , 2010, ECIR.

[42]  Nathanael Chambers,et al.  Labeling Documents with Timestamps: Learning from their Time Expressions , 2012, ACL.

[43]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[44]  Gerhard Weikum,et al.  Bridging the Terminology Gap in Web Archive Search , 2009, WebDB.

[45]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[46]  Ricardo Campos,et al.  GTE: a distributional second-order co-occurrence approach to improve the identification of top relevant dates in web snippets , 2012, CIKM '12.