Graph and network data: mining the temporal dimension

In the last years, there have been many studies on analyzing network and graph data. A wide range of problems, such as studying the global and local properties of a graph, finding interesting structures, modeling particular characteristics, assessing the properties of some particular networks such as the Web or a co-authorship networks, have increased the attention of the scientific community, involved in finding efficient and powerful techniques to enable the achievement of the desired results. For example, with the aim of finding interesting and frequent substructures in graphs, algorithms such as AGM, FSG, gSpan, Gaston, FFSMY, ADI-Mine, HSIGRAM and VSIGRAM have been presented for improving scalability on mining subgraphs one after one. However, only in the last few years the attention has moved to a particular aspect of graphs and networks: the temporal dimension. Thanks also to the larger availability of online social network services, the amount of data that allows for the analysis of the dynamics of complex networks has increased very fast in the last five years. This kind of data contains rich information about what happens to a network during time, and enables the analysts to model and discover interesting properties related to the temporal dimension, which are both meaningless and impossible in the static setting. The temporal dimension can play a double role for a network. First, the underlying structure, namely the graph, can evolve over time, showing new users joining a community, new connections created among users, change of properties of a particular group of people, and so on. Second, given an established network, users may perform actions during time, leading to flows of information circulating among the connections, sequences of tasks performed by a sequence of users, spread of influence among the network, and so on. Despite the clear richness of the above setting, the current graph mining techniques are somehow too generic, and they do not explicitly take into consideration the time during their stages. In order to overcome to this problem, in this thesis we study the current graph mining algorithms, we study the possibility of pushing constraints during the computation that would allow us to efficiently analyze the temporal dimension at mining stage, and we develop new techniques that can help in this kind of analysis. In order to prove the effectiveness of our approach, we apply a pre-existent graph miner, a modified version of it specialized to deal with the temporal dimension, and another pre-existent tool of analysis, namely the Temporally Annotated Sequences framework, to real data, to show how we can deal with the above setting, with particular focus on problems such as mining the information propagation in a network, mining graph evolution rules, and mining the temporal dimension of process logs to derive the actual workflow diagram in a process. Our results justify the need for this approach, and show that specialized techniques help in modeling and analyzing temporal graph and network data.

[1]  Dino Pedreschi,et al.  Mining sequences with temporal annotations , 2006, SAC '06.

[2]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[3]  Wil M. P. van der Aalst,et al.  Business Process Management, Models, Techniques, and Empirical Studies , 2000 .

[4]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[5]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[6]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[7]  Fosca Giannotti,et al.  Temporal mining for interactive workflow data analysis , 2009, KDD.

[8]  Christian Borgelt,et al.  Subgraph Support in a Single Large Graph , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[9]  Ada Wai-Chee Fu,et al.  Discovering Temporal Patterns for Interval-Based Events , 2000, DaWaK.

[10]  Joost N. Kok,et al.  Faster Association Rules for Multiple Relations , 2001, IJCAI.

[11]  Hiroshi Motoda,et al.  Graph-based induction as a unified learning framework , 1994, Applied Intelligence.

[12]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[13]  Dino Pedreschi,et al.  Efficient Mining of Temporally Annotated Sequences , 2006, SDM.

[14]  Luigi Pontieri,et al.  Discovering expressive process models by clustering log traces , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Hongyan Ma Process-aware information systems: Bridging people and software through process technology: Book Reviews , 2007 .

[16]  Jochen Hipp,et al.  Mining Sequences of Temporal Intervals , 2006, PKDD.

[17]  Siegfried Nijssen,et al.  What Is Frequent in a Single Graph? , 2007, PAKDD.

[18]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[19]  Alexander L. Wolf,et al.  Discovering models of software processes from event-based data , 1998, TSEM.

[20]  Guido Schimm Process Miner - A Tool for Mining Process Schemes from Event-Based Data , 2002, JELIA.

[21]  Toon Calders,et al.  Anti-monotonic Overlap-Graph Support Measures , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[22]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[23]  Eddie Cheng,et al.  Time-stamped Graphs and Their Associated Influence Digraphs , 2003, Discret. Appl. Math..

[24]  Eytan Adar,et al.  Implicit Structure and the Dynamics of Blogspace , 2004 .

[25]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[26]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[27]  Mong-Li Lee,et al.  Mining relationships among interval-based events for classification , 2008, SIGMOD Conference.

[28]  Francesco Bonchi,et al.  Pushing Tougher Constraints in Frequent Pattern Mining , 2005, PAKDD.

[29]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Dino Pedreschi,et al.  ExAnte: Anticipated Data Reduction in Constrained Pattern Mining , 2003, PKDD.

[31]  Dimitrios Gunopulos,et al.  Mining Process Models from Workflow Logs , 1998, EDBT.

[32]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[33]  Jon M. Kleinberg,et al.  The structure of information pathways in a social communication network , 2008, KDD.

[34]  Jian Pei,et al.  Can we push more constraints into frequent pattern mining? , 2000, KDD '00.

[35]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[36]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[37]  Baptiste Jeudy,et al.  Using Constraints During Set Mining: Should We Prune or not? , 2000 .

[38]  John F. Roddick,et al.  Discovering Richer Temporal Association Rules from Interval-Based Data , 2005, DaWaK.

[39]  Kenneth J. Arrow,et al.  Information Dynamics in the Networked World , 2003, Inf. Syst. Frontiers.

[40]  Jaideep Srivastava,et al.  Mining Temporally Changing Web Usage Graphs , 2004, WebKDD.

[41]  Jon M. Kleinberg,et al.  Tracing information flow on a global scale using Internet chain-letter data , 2008, Proceedings of the National Academy of Sciences.

[42]  Kees M. van Hee,et al.  Workflow Management: Models, Methods, and Systems , 2002, Cooperative information systems.

[43]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[44]  Laks V. S. Lakshmanan,et al.  Discovering leaders from community actions , 2008, CIKM '08.

[45]  van der Wmp Wil Aalst,et al.  Workflow mining: which processes can be rediscovered? , 2002 .

[46]  Hisashi Kashima,et al.  Kernels for graph classification , 2002 .

[47]  Christos Faloutsos,et al.  Monitoring Network Evolution using MDL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[48]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[49]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[50]  Hannu Toivonen,et al.  Discovery of frequent DATALOG patterns , 1999, Data Mining and Knowledge Discovery.

[51]  Jeffrey Xu Yu,et al.  Spotting Significant Changing Subgraphs in Evolving Graphs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[52]  Lawrence B. Holder,et al.  Concept Formation Using Graph Grammars , 2002, KDD 2002.

[53]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[54]  Frank Klawonn,et al.  Finding informative rules in interval sequences , 2001, Intell. Data Anal..

[55]  Bart Goethals,et al.  FP-Bonsai: The Art of Growing and Pruning Small FP-Trees , 2004, PAKDD.

[56]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[57]  Sudarshan S. Chawathe,et al.  SEuS: Structure Extraction Using Summaries , 2002, Discovery Science.

[58]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[59]  Chih-Ping Wei,et al.  Discovery of temporal patterns from process instances , 2004, Comput. Ind..

[60]  Dimitrios Gunopulos,et al.  Discovering frequent arrangements of temporal intervals , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[61]  Lada A. Adamic,et al.  How to search a social network , 2005, Soc. Networks.

[62]  Joachim Herbst,et al.  A Machine Learning Approach to Workflow Management , 2000, ECML.

[63]  Wil M.P. van der Aalst,et al.  Process mining: discovering workflow models from event-based data , 2001 .

[64]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[65]  San-Yih Hwang,et al.  On the discovery of process models from their instances , 2002, Decis. Support Syst..

[66]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[67]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[68]  Fosca Giannotti,et al.  Mining the Temporal Dimension of the Information Propagation , 2009, IDA.

[69]  Tanya Y. Berger-Wolf,et al.  A framework for community identification in dynamic social networks , 2007, KDD '07.

[70]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[71]  Bernardo A. Huberman,et al.  Email as spectroscopy: automated discovery of community structure within organizations , 2003 .

[72]  Philip S. Yu,et al.  gPrune: A Constraint Pushing Framework for Graph Pattern Mining , 2007, PAKDD.

[73]  Lada A. Adamic,et al.  Information flow in social groups , 2003, cond-mat/0305305.

[74]  Anindya Datta,et al.  Automating the Discovery of AS-IS Business Process Models: Probabilistic and Algorithmic Approaches , 1998, Inf. Syst. Res..

[75]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[76]  Christoph Bussler,et al.  Workflow Management: Modeling Concepts, Architecture and Implementation , 1996 .

[77]  Dino Pedreschi,et al.  ExAMiner: optimized level-wise frequent pattern mining with monotone constraints , 2003, Third IEEE International Conference on Data Mining.

[78]  Chen Wang,et al.  Constraint-Based Graph Mining in Large Database , 2005, APWeb.

[79]  Boudewijn F. van Dongen,et al.  Workflow mining: A survey of issues and approaches , 2003, Data Knowl. Eng..

[80]  Lars G Fischer Workflow Handbook 2002, Workflow Management Coalition , 2002 .

[81]  Fabian Mörchen,et al.  Algorithms for time series knowledge mining , 2006, KDD '06.

[82]  Jure Leskovec,et al.  Microscopic evolution of social networks , 2008, KDD.

[83]  Jimeng Sun,et al.  Beyond streams and graphs: dynamic tensor analysis , 2006, KDD '06.

[84]  Franco Turini,et al.  Mining Clinical Data with a Temporal Dimension: A Case Study , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[85]  Hans-Peter Kriegel,et al.  Pattern Mining in Frequent Dynamic Subgraphs , 2006, Sixth International Conference on Data Mining (ICDM'06).

[86]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[87]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[88]  Chen Wang,et al.  Scalable mining of large disk-based graph databases , 2004, KDD.

[89]  Philip S. Yu,et al.  Online Analysis of Community Evolution in Data Streams , 2005, SDM.

[90]  Takashi Washio,et al.  A Fast Method to Mine Frequent Subsequences from Graph Sequence Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[91]  Domenico Saccà,et al.  Mining Unconnected Patterns in Workflows , 2005, SDM.