A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

This work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analysis. It is based on a random walk model through the database defining a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic complementation. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined, and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.

[1]  Gerald Benoît,et al.  Link analysis: An information science approach , 2006, J. Assoc. Inf. Sci. Technol..

[2]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[3]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[4]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[5]  Michel Tenenhaus,et al.  An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data , 1985 .

[6]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[7]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[8]  Heikki Mannila,et al.  Relational link-based ranking , 2004, VLDB.

[9]  Michael Greenacre,et al.  Special issue on correspondence analysis and related methods , 2009, Comput. Stat. Data Anal..

[10]  Chris H. Q. Ding,et al.  Nonnegative Lagrangian Relaxation of K-Means and Spectral Clustering , 2005, ECML.

[11]  François Fouss,et al.  An Experimental Investigation of Graph Kernels on a Collaborative Recommendation Task , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[13]  Kenneth M. Hall An r-Dimensional Quadratic Placement Algorithm , 1970 .

[14]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[15]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  Chris Chatfield,et al.  Statistical Methods for Spatial Data Analysis , 2004 .

[18]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[19]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[20]  François Fouss,et al.  HITS is principal components analysis , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[21]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[22]  Martine D. F. Schlag,et al.  Multi-level spectral hypergraph partitioning with arbitrary vertex sizes , 1996, Proceedings of International Conference on Computer Aided Design.

[23]  Heungsun Hwang,et al.  An Extension of Multiple Correspondence Analysis for Identifying Heterogeneous Subgroups of Respondents , 2006 .

[24]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[25]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[26]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[27]  D. Lusseau,et al.  The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations , 2003, Behavioral Ecology and Sociobiology.

[28]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[29]  François Fouss,et al.  Graph Nodes Clustering Based on the Commute-Time Kernel , 2007, PAKDD.

[30]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[31]  Pedro M. Domingos Prospects and challenges for multi-relational data mining , 2003, SKDD.

[32]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[33]  François Fouss,et al.  The Principal Components Analysis of a Graph, and Its Relationships to Spectral Clustering , 2004, ECML.

[34]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[35]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[36]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[37]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[38]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[39]  Lawrence B. Holder,et al.  Mining Graph Data: Cook/Mining Graph Data , 2006 .

[40]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[42]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[43]  D. Young,et al.  A Survey of Numerical Mathematics , 1988 .

[44]  François Fouss,et al.  Links between Kleinberg's hubs and authorities, correspondence analysis, and Markov chains , 2003, Third IEEE International Conference on Data Mining.

[45]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[46]  A. John MINING GRAPH DATA , 2022 .

[47]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[48]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms. By Pierre Baldi, Paolo Frasconi, Padhraic Smith, John Wiley and Sons Ltd., West Sussex, England, 2003. 285 pp ISBN 0 470 84906 1 , 2006, Inf. Process. Manag..

[49]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[50]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms: Baldi/Probabilistic , 2002 .

[51]  Robert Haining,et al.  Statistics for spatial data: by Noel Cressie, 1991, John Wiley & Sons, New York, 900 p., ISBN 0-471-84336-9, US $89.95 , 1993 .

[52]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[53]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[55]  Carl D. Meyer,et al.  Stochastic Complementation, Uncoupling Markov Chains, and the Theory of Nearly Reducible Systems , 1989, SIAM Rev..

[56]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[57]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[58]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[59]  Jim Freeman,et al.  Stochastic Processes (Second Edition) , 1996 .

[60]  Luciano Rossoni,et al.  Models and methods in social network analysis , 2006 .

[61]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[62]  Ronald R. Coifman,et al.  Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators , 2005, NIPS.

[63]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[65]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[66]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[67]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[68]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[69]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[70]  François Fouss,et al.  Graph nodes clustering with the sigmoid commute-time kernel: A comparative study , 2009, Data Knowl. Eng..

[71]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[72]  François Fouss,et al.  Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[73]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[74]  Ann B. Lee,et al.  Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[76]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[77]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[78]  Jiawei Han,et al.  Spectral Clustering , 2018, Data Clustering: Algorithms and Applications.

[79]  Xin Geng,et al.  Supervised nonlinear dimensionality reduction for visualization and classification , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[80]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.