Refining causality: who copied from whom?

Inferring causal networks behind observed data is an active area of research with wide applicability to areas such as epidemiology, microbiology and social science. In particular recent research has focused on identifying how information propagates through the Internet. This research has so far only used temporal features of observations, and while reasonable results have been achieved, there is often further information which can be used. In this paper we show that additional features of the observed data can be used very effectively to improve an existing method. Our particular example is one of inferring an underlying network for how text is reused in the Internet, although the general approach is applicable to other inference methods and information sources. We develop a method to identify how a piece of text evolves as it moves through an underlying network and how substring information can be used to narrow down where in the evolutionary process a particular observation at a node lies. Hence we narrow down the number of ways the node could have acquired the infection. Text reuse is detected using a suffix tree which is also used to identify the substring relations between chunks of reused text. We then use a modification of the NetCover method to infer the underlying network. Experimental results -- on both synthetic and real life data -- show that using more information than just timing leads to greater accuracy in the inferred networks.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  A. Rbnyi ON THE EVOLUTION OF RANDOM GRAPHS , 2001 .

[3]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[6]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[7]  B. Bollobás The evolution of random graphs , 1984 .

[8]  Nello Cristianini,et al.  Inference and Validation of Networks , 2009, ECML/PKDD.

[9]  Nello Cristianini,et al.  Reconstruction of Causal Networks by Set Covering , 2011, ICANNGA.

[10]  Florent Nicart,et al.  A hard-disk based suffix tree implementation , 2011 .

[11]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[12]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[13]  Lada A. Adamic,et al.  Tracking information epidemics in blogspace , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[14]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[15]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[16]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[17]  Béla Bollobás,et al.  Directed scale-free graphs , 2003, SODA '03.

[18]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[19]  Petr Slavík Improved Performance of the Greedy Algorithm for Partial Cover , 1997, Inf. Process. Lett..