Functional Node Detection on Linked Data

Networks, which characterize object relationships, are ubiquitous in various domains. One very important problem is to detect the nodes of a specific function in these networks. For example, is a user normal or anomalous in an email network? Does a protein play a key role in a protein-protein interaction network? In many applications, the information we have about the networks usually includes both node characteristics and network structures. Both types of information can contribute to the task of learning functional nodes, and we call the collection of node and link information as linked data. However, existing methods only use a few subjectively selected topological features from network structures to detect functional nodes, thus fail to include highly discriminative and meaningful patterns hidden in linked data. To address this problem, a novel Feature Integration based Functional N ode Detection (FIND) algorithm is presented. Specifically, FIND extracts the most discriminative information from both node characteristics and network structures in the form of a unified latent feature representation with the guidance of several labeled nodes. Experiments on two real world data sets validate that the proposed method significantly outperforms the baselines on the detection of three different types of functional nodes. 1 Background and Motivation During the formation and evolution of a network, nodes usually have various types of roles or functionalities. Detection of the nodes having a specific functionality in a network is essential for understanding the corresponding patterns of the network. For instance, critical nodes in social networks are used for contagion analysis [7], bridging nodes in protein interaction networks stand for the key proteins connecting modules [4], and spammers in email networks need to be filtered out for better constitutions of the systems [9]. Existing methods for the problem usually make strong assumptions on the relationships between specific topological properties and the types of functional nodes of interest. For example, nodes having high bridging centralities are categorized as bridging nodes [4], and high PageRank scores of nodes indicate high importance degrees [11]. The selection of such topological properties in each task is usually subjective or sometimes even arbitrary, thus existing methods may miss critical patterns characterizing node functionalities in the networks. Moreover, when the aimed functional nodes are complex, it could be very difficult to obtain their relationships w.r.t. existing topological properties, which makes such strategy ineffective in real practice. As we know, information about a real network usually involves both high-dimensional node characteristics and a network structure which consists of links between the nodes. These two types of information are jointly referred to as linked data. For example, on Facebook, each node denotes a person whose characteristics include preferences, posts, number of friends, etc. and links represent interactions between the users. In the data, both node and link information are critical for the task of identifying the role that each node plays in the network. For example, zombie users tend to have more friends (link structure) than normal users and post meaningless posts (node characteristics). Existing approaches that only use link structure information fail to capture the relevant information in node characteristics. Besides, since link structure describes connectivity between users and is not directly interpreting node functionalities, the selected topological features (e.g. number of friends) on the link structure may miss critical information hidden in the link structure. In conclusion, an effective functional node detection approach has to successfully utilize the information from both the network structure and the node characteristics. In this paper, we propose a novel Feature Integration based Functional Node Detection (FIND) model for the task of detecting nodes of specific functionalities on linked data. Specifically, the proposed FIND model seeks to simultaneously map the information of these two aspects (network structures and node characteristics) to a unified latent feature space to capture the shared characteristics of these two aspects. Besides, several labeled nodes are utilized to guide the mapping and learning process, thus the extracted latent feature representation of the nodes can effectively cap1 Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

[1]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[2]  Hisashi Kashima,et al.  Kernels for graph classification , 2002 .

[3]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[4]  Paolo Massa,et al.  Trustlet, Open Research on Trust Metrics , 2001, BIS.

[5]  Aidong Zhang,et al.  Bridging Centrality: Identifying Bridging Nodes in Scale-free Networks , 2006 .

[6]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[7]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[8]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Zhi-Hua Zhou,et al.  A New Analysis of Co-Training , 2010, ICML.

[11]  Kurt Mehlhorn,et al.  Weisfeiler-Lehman Graph Kernels , 2011, J. Mach. Learn. Res..

[12]  Madhav V. Marathe,et al.  Finding Critical Nodes for Inhibiting Diffusion of Complex Contagions in Social Networks , 2010, ECML/PKDD.

[13]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[14]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[15]  Huan Liu,et al.  Unsupervised feature selection for linked social media data , 2012, KDD.

[16]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[17]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Liang Ge,et al.  Pseudo Cold Start Link Prediction with Multiple Sources in Social Networks , 2012, SDM.

[19]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[20]  Chris H. Q. Ding,et al.  Simultaneous clustering of multi-type relational data via symmetric nonnegative matrix tri-factorization , 2011, CIKM '11.

[21]  Huan Liu,et al.  Feature Selection with Linked Data in Social Media , 2012, SDM.