Predicting Document Creation Times in News Citation Networks

For the temporal analysis of news articles or the extraction of temporal expressions from such documents, accurate document creation times are indispensable. While document creation times are available as time stamps or HTML metadata in many cases, depending on the document collection in question, this data can be inaccurate or incomplete in others. Especially in digitally published online news articles, publication times are often missing from the article or inaccurate due to (partial) updates of the content at a later time. In this paper, we investigate the prediction of document creation times for articles in citation networks of digitally published news articles, which provide a network structure of knowledge flows between individual articles in addition to the contained temporal expressions. We explore the evolution of such networks to motivate the extraction of suitable features, which we utilize in a subsequent prediction of document creation times, framed as a regression task. Based on our evaluation of several established machine learning regressors on a large network of English news articles, we show that the combination of temporal and local structural features allows for the estimation of document creation times from the network.

[1]  Xavier Tannier Extracting News Web Page Creation Time with DCTFinder , 2014, LREC.

[2]  Cristina Ribeiro,et al.  Using neighbors to date web documents , 2007, WIDM '07.

[3]  Djoerd Hiemstra,et al.  Temporal Language Models for the Disclosure of Historical Text , 2005 .

[4]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[5]  David Carmel,et al.  Trend detection through temporal link analysis , 2004, J. Assoc. Inf. Sci. Technol..

[6]  Michael L. Nelson,et al.  Carbon dating the web: estimating the age of web resources , 2013, WWW '13 Companion.

[7]  Stefan Bornholdt,et al.  Handbook of Graphs and Networks: From the Genome to the Internet , 2003 .

[8]  Martin A. Riedmiller,et al.  Advanced supervised learning in multi-layer perceptrons — From backpropagation to adaptive learning algorithms , 1994 .

[9]  Masaru Kitsuregawa,et al.  What's really new on the web?: identifying new pages from a series of unstable web snapshots , 2006, WWW '06.

[10]  Stefan Fritsch,et al.  neuralnet: Training of Neural Networks , 2010, R J..

[11]  Zhifang Sui,et al.  Event-Based Time Label Propagation for Automatic Dating of News Articles , 2013, EMNLP.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[14]  Daniel F. Schmidt,et al.  High-Dimensional Bayesian Regularised Regression with the BayesReg Package , 2016, 1611.06649.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[17]  Yue Zhao,et al.  Sub-document Timestamping of Web Documents , 2015, SIGIR.

[18]  Andreas Spitz,et al.  Breaking the news: Extracting the sparse citation network backbone of online news articles , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[19]  Kjetil Nørvåg,et al.  Using Temporal Language Models for Document Dating , 2009, ECML/PKDD.

[20]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[21]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[22]  Nathanael Chambers,et al.  Labeling Documents with Timestamps: Learning from their Time Expressions , 2012, ACL.

[23]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[24]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[25]  Peter Christen,et al.  Event Diffusion Patterns in Social Media , 2012, ICWSM.

[26]  Béla Bollobás,et al.  Mathematical results on scale‐free random graphs , 2005 .