Assessing the Significance of Data Mining Results on Graphs with Feature Vectors

Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.

[1]  Alexander J. Smola,et al.  Like like alike: joint friendship and interest propagation in social networks , 2011, WWW.

[2]  Charu C. Aggarwal,et al.  Managing and Mining Graph Data , 2010, Managing and Mining Graph Data.

[3]  Xiaowei Ying,et al.  Randomizing Social Networks: a Spectrum Preserving Approach , 2008, SDM.

[4]  Martin Ester,et al.  A matrix factorization technique with trust propagation for recommendation in social networks , 2010, RecSys '10.

[5]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[6]  Heikki Mannila,et al.  Randomization methods for assessing data analysis results on real‐valued matrices , 2009, Stat. Anal. Data Min..

[7]  Gemma C. Garriga,et al.  Evaluating Query Result Significance in Databases via Randomizations , 2010, SDM.

[8]  Nico M. Temme,et al.  Asymptotic estimates of Stirling numbers , 1993 .

[9]  P. Priouret,et al.  Bayesian Time Series Models: Adaptive Markov chain Monte Carlo: theory and methods , 2011 .

[10]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[11]  Gemma C. Garriga,et al.  Randomization Techniques for Graphs , 2009, SDM.

[12]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[13]  Thomas Seidl,et al.  DB-CSC: A Density-Based Approach for Subspace Clustering in Graphs with Feature Vectors , 2011, ECML/PKDD.

[14]  Thomas Seidl,et al.  Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms , 2010, 2010 IEEE International Conference on Data Mining.

[15]  Ravi Kumar,et al.  Influence and correlation in social networks , 2008, KDD.

[16]  Ichigaku Takigawa,et al.  A spectral clustering approach to optimally combining numericalvectors with a modular network , 2007, KDD '07.

[17]  Tijl De Bie,et al.  Maximum Entropy Modelling for Assessing Results on Real-Valued Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[18]  Michael R. Lyu,et al.  Learning to recommend with social trust ensemble , 2009, SIGIR.

[19]  Evimaria Terzi,et al.  Reconstructing Randomized Social Networks , 2010, SDM.

[20]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[21]  J. Besag,et al.  Generalized Monte Carlo significance tests , 1989 .

[22]  T. W. Anderson On the Distribution of the Two-Sample Cramer-von Mises Criterion , 1962 .

[23]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[24]  S Natasha Beretvas,et al.  Meta-analytic methods of pooling correlation matrices for structural equation modeling under different patterns of missing data. , 2005, Psychological methods.