PSM-Flow: Probabilistic Subgraph Mining for Discovering Reusable Fragments in Workflows

Scientific workflows define computational processes needed for carrying out scientific experiments. Existing workflow repositories contain hundreds of scientific workflows, where scientists can find materials and knowledge to facilitate workflow design for running related experiments. Identifying reusable fragments in growing workflow repositories has become increasingly important. In this paper, we present PSM-Flow, a probabilistic subgraph mining algorithm designed to discover commonly occurring fragments in a workflow corpus using a modified version of the Latent Dirichlet Allocation algorithm. The proposed model encodes the geodesic distance between workflow steps into the model for implicitly modeling fragments. PSM-Flow captures variations of frequent fragments while maintaining its space complexity bounded polynomially, as it requires no candidate generation. We applied PSM-Flow to three real-world scientific workflow datasets containing more than 750 workflows for neuroimaging analysis. Our results show that PSM-Flow outperforms three state of the art frequent subgraph mining techniques. We also discuss other potential future improvements of the proposed method.

[1]  Arthur W. Toga,et al.  Effi cient , distributed and interactive neuroimaging data analysis using the LONI Pipeline , 2009 .

[2]  Wei Wang,et al.  REAFUM: Representative Approximate Frequent Subgraph Mining , 2015, SDM.

[3]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[4]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[5]  Ting Chen,et al.  Network motif identification in stochastic networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[7]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[8]  George Karypis,et al.  GREW - a scalable frequent subgraph discovery algorithm , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[9]  Paul M. Thompson,et al.  Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users , 2014, 2014 IEEE 10th International Conference on e-Science.

[10]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[11]  Jiawei Han,et al.  gApprox: Mining Frequent Approximate Patterns from a Massive Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[12]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[13]  Paul M. Thompson,et al.  FragFlow Automated Fragment Detection in Scientific Workflows , 2014, 2014 IEEE 10th International Conference on e-Science.

[14]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Paul T. Groth,et al.  Wings: Intelligent Workflow-Based Design of Computational Experiments , 2011, IEEE Intelligent Systems.

[16]  Ulf Leser,et al.  (Re)Use in Public Scientific Workflow Repositories , 2012, SSDBM.

[17]  Daniel Garijo The LONI Pipeline workflow inputs , 2015 .

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Carole A. Goble,et al.  Common motifs in scientific workflows: An empirical analysis , 2012, 2012 IEEE 8th International Conference on E-Science.

[20]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[21]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[22]  Claudia Diamantini,et al.  Mining usage patterns from a repository of scientific workflows , 2012, SAC '12.

[23]  Kai Liu,et al.  Detecting Multiple Stochastic Network Motifs in Network Data , 2012, PAKDD.

[24]  Yolanda Gil,et al.  From data to knowledge to discoveries: Artificial intelligence and scientific workflows , 2009, Sci. Program..

[25]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Frans Coenen,et al.  A survey of frequent subgraph mining algorithms , 2012, The Knowledge Engineering Review.