A Hypergraph-based Method for Discovering Semantically Associated Itemsets

In this paper, we address an interesting data mining problem of finding semantically associated item sets, i.e., items connected via indirect links. We propose a novel method for discovering semantically associated item sets based on a hyper graph representation of the database. We describe two similarity measures to compute the strength of associations between items. Specifically, we introduce the average commute time similarity, $\mathbf{s_{CT}}$, based on the random walk model on hyper graph, and the inner-product similarity, $\mathbf{s_{L+}}$, based on the Moore-Penrose pseudoinverse of the hyper graph Laplacian matrix. Given semantically associated 2-itemsets generated by these measures, we design a hyper graph expansion method with two search strategies, namely, the clique and connected component search, to generate $k$-item sets ($k>2$). We show the proposed method is indeed capable of capturing semantically associated item sets through experiments performed on three datasets ranging from low to high dimensionality. The semantically associated item sets discovered in our experiment is promising to provide valuable insights on interrelationship between medical concepts and other domain specific concepts.

[1]  Christos Faloutsos,et al.  Electricity Based External Similarity of Categorical Attributes , 2003, PAKDD.

[2]  François Fouss,et al.  Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[4]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[5]  Christos Faloutsos,et al.  Cross-Modal Correlation Mining Using Graph Algorithms , 2007 .

[6]  Amit P. Sheth,et al.  Semantic Association Identification and Knowledge Discovery for National Security Applications , 2005, J. Database Manag..

[7]  A. Zinober Matrices: Methods and Applications , 1992 .

[8]  Peter G. Doyle,et al.  Random Walks and Electric Networks: REFERENCES , 1987 .

[9]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[10]  Qian Wan,et al.  Efficient Mining of Indirect Associations Using HI-Mine , 2003, Canadian Conference on AI.

[11]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[12]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[13]  A. B. Rami Shani,et al.  Matrices: Methods and Applications , 1992 .

[14]  J. Delvenne,et al.  Random walks on graphs , 2004 .

[15]  M. Randic,et al.  Resistance distance , 1993 .

[16]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[17]  Don R. Swanson,et al.  Two medical literatures that are logically but not bibliographically connected , 1987, J. Am. Soc. Inf. Sci..

[18]  Ian M. Hodkinson,et al.  Finite conformal hypergraph covers and Gaifman cliques in finite structures , 2003, Bull. Symb. Log..

[19]  I. N. Herstein,et al.  Matrix Theory and Linear Algebra , 2018, Formation Control of Multi-Agent Systems.

[20]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[21]  Jaideep Srivastava,et al.  Indirect Association: Mining Higher Order Dependencies in Data , 2000, PKDD.