Network Sampling Based on Centrality Measures for Relational Classification

Many real-world networks, such as the Internet, social networks, biological networks, and others, are massive in size, which impairs their processing and analysis. To cope with this, the network size could be reduced without losing relevant information. In this paper, we extend a work that proposed a sampling method based on the following centrality measures: degree, k-core, clustering, eccentricity and structural holes. For our experiments, we remove \(30\%\) and \(50\%\) of the vertices and their edges from the original network. After, we evaluate our proposal on six real-world networks on relational classification task using eight different classifiers. Classification results achieved on sampled graphs generated from our proposal are similar to those obtained on the entire graphs. The execution time for learning step of the classifier is shorter on the sampled graph compared to the entire graph and random sampling. In most cases, the original graph was reduced by up to \(50\%\) of its initial number of edges without losing topological properties.

[1]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[2]  Jiwon Hong,et al.  A community-based sampling method using DPL for online social networks , 2011, Inf. Sci..

[3]  Alneu de Andrade Lopes,et al.  A naïve Bayes model based on overlapping groups for link prediction in online social networks , 2015, SAC.

[4]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[5]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[6]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[7]  Hawoong Jeong,et al.  Statistical properties of sampled networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[9]  Soon-Hyung Yook,et al.  Statistical properties of sampled networks by random walks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  James Moody,et al.  Network sampling coverage II: The effect of non-random missing data on network measurement , 2017, Soc. Networks.

[11]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[12]  Alneu de Andrade Lopes,et al.  Classification Based on the Optimal K-Associated Network , 2009, Complex.

[13]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[14]  Mohammad Reza Meybodi,et al.  Sampling algorithms for weighted networks , 2016, Social Network Analysis and Mining.

[15]  Mark S Handcock,et al.  7. Respondent-Driven Sampling: An Assessment of Current Methodology , 2009, Sociological methodology.

[16]  John Scott,et al.  The SAGE Handbook of Social Network Analysis , 2011 .

[17]  Ramana Rao Kompella,et al.  Network Sampling Designs for Relational Classification , 2012, ICWSM.

[18]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[19]  Yang Zhang,et al.  A novel green algorithm for sampling complex networks , 2016, J. Netw. Comput. Appl..

[20]  Ramana Rao Kompella,et al.  Network Sampling: From Static to Streaming Graphs , 2012, TKDD.

[21]  Francisco Aparecido Rodrigues,et al.  Influence Maximization Based on the Least Influential Spreaders , 2015, SocInf@IJCAI.

[22]  Mohammad Reza Meybodi,et al.  Sampling social networks using shortest paths , 2015 .

[23]  Clayton Fink,et al.  Complex contagions and the diffusion of popular Twitter hashtags in Nigeria , 2015, Social Network Analysis and Mining.