Efficiently spotting the starting points of an epidemic in a large graph

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

[1]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[2]  Jilles Vreeken,et al.  The Odd One Out: Identifying and Characterising Anomalies , 2011, SDM.

[3]  Devavrat Shah,et al.  Rumors in a Network: Who's the Culprit? , 2009, IEEE Transactions on Information Theory.

[4]  Hui Xiong,et al.  Information propagation in online social networks: a tie-strength perspective , 2011, Knowledge and Information Systems.

[5]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[6]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[7]  D. Cvetkovic,et al.  Spectra of Graphs: Theory and Applications , 1997 .

[8]  S. Bikhchandani,et al.  You have printed the following article : A Theory of Fads , Fashion , Custom , and Cultural Change as Informational Cascades , 2007 .

[9]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[10]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[11]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[12]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[13]  Michalis Faloutsos,et al.  Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms , 2010, ECML/PKDD.

[14]  Peris,et al.  Tests of Large N(c) QCD from Hadronic tau Decay. , 2001, Physical review letters.

[15]  P. Kaye Infectious diseases of humans: Dynamics and control , 1993 .

[16]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[17]  Wei Chen,et al.  Scalable influence maximization for prevalent viral marketing in large-scale social networks , 2010, KDD.

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[20]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[21]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.

[22]  D. Cvetkovic,et al.  Spectra of graphs : theory and application , 1995 .

[23]  J. Rissanen,et al.  ON SEQUENTIALLY NORMALIZED MAXIMUM LIKELIHOOD MODELS , 2008 .

[24]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[25]  Laks V. S. Lakshmanan,et al.  SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model , 2011, 2011 IEEE 11th International Conference on Data Mining.

[26]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[27]  Jeffrey O. Kephart,et al.  Measuring and modeling computer virus prevalence , 1993, Proceedings 1993 IEEE Computer Society Symposium on Research in Security and Privacy.

[28]  Dimitrios Gunopulos,et al.  Finding effectors in social networks , 2010, KDD.

[29]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[30]  Christos Faloutsos,et al.  Patterns of Cascading Behavior in Large Blog Graphs , 2007, SDM.

[31]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[32]  Masahiro Kimura,et al.  Efficient discovery of influential nodes for SIS models in social networks , 2011, Knowledge and Information Systems.

[33]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[34]  Nikolai K. Vereshchagin,et al.  Kolmogorov's structure functions and model selection , 2002, IEEE Transactions on Information Theory.

[35]  Charles R. MacCluer,et al.  The Many Proofs and Applications of Perron's Theorem , 2000, SIAM Rev..

[36]  Devavrat Shah,et al.  Detecting sources of computer viruses in networks: theory and experiment , 2010, SIGMETRICS '10.

[37]  Michalis Faloutsos,et al.  Threshold conditions for arbitrary cascade models on arbitrary networks , 2011, 2011 IEEE 11th International Conference on Data Mining.

[38]  Christos Faloutsos,et al.  On data mining, compression, and Kolmogorov complexity , 2007, Data Mining and Knowledge Discovery.

[39]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[40]  Christos Faloutsos,et al.  Epidemic thresholds in real networks , 2008, TSEC.

[41]  Patrick Lincoln,et al.  Epidemic profiles and defense of scale-free networks , 2003, WORM '03.

[42]  Donald F. Towsley,et al.  The effect of network topology on the spread of epidemics , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[43]  J. Leskovec,et al.  Cascading Behavior in Large Blog Graphs Patterns and a model , 2006 .

[44]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[45]  Christos Faloutsos,et al.  On the Vulnerability of Large Graphs , 2010, 2010 IEEE International Conference on Data Mining.

[46]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .