Link Analysis of Higher-Order Paths in Supervised Learning Datasets

Due to recent concerns with security and terrorism there has been an increasing focus on techniques that discover links and relations in data. Several efforts that employ “data mining” techniques have contributed to this field, but few focus on discovering patterns in sets of higher-order links, which can reveal hidden or indirect relationships in data. In this work we focus on the discovery and analysis of higher-order path patterns in a supervised learning dataset. We first analyze higher-order links in the leaf nodes of a decision tree and find evidence for distinguishing between nodes of different classes. Based on these results we next focus on the training data itself used to build the tree. Our results indicate that classes of instances in labeled training data may be separable based on the characteristics of higher-order paths. This technique has potential applications in cybersecurity and cyberforensics, as well as text mining and analytics.

[1]  Takeaki Uno,et al.  Algorithms for Enumerating All Perfect, Maximum and Maximal Matchings in Bipartite Graphs , 1997, ISAAC.

[2]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[3]  Padma Raghavan,et al.  Level search schemes for information filtering and retrieval , 2001, Inf. Process. Manag..

[4]  Hsinchun Chen,et al.  Using Coplink to Analyze Criminal-Justice Data , 2002, Computer.

[5]  Hsinchun Chen,et al.  Fighting organized crimes: using shortest-path algorithms to identify associations in criminal networks , 2004, Decis. Support Syst..

[6]  Jimeng Sun,et al.  Relevance search and anomaly detection in bipartite graphs , 2005, SKDD.

[7]  Raymond J. Mooney,et al.  Relational Data Mining with Inductive Logic Programming for Link Discovery , 2002 .

[8]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[11]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[12]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Richard M. Wilson,et al.  A course in combinatorics , 1992 .

[15]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[16]  Philip Edmonds Choosing the word most typical in context using a lexical co-occurrence network , 1997 .

[17]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[18]  William M. Pottenger,et al.  A Software Infrastructure for Research in Textual Data Mining , 2004, Int. J. Artif. Intell. Tools.

[19]  K. J. Lynch,et al.  Automatic construction of networks of concepts characterizing document databases , 1992, IEEE Trans. Syst. Man Cybern..

[20]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[21]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.