An Empirical Comparison of Label Prediction Algorithms on Automatically Inferred Networks

The task of predicting the label of a network node, based on the labels of the remaining nodes, is an area of growing interest in machine learning, as various types of data are naturally represented as nodes in a graph. As an increasing number of methods and approaches are proposed to solve this task, the problem of comparing their performance becomes of key importance. In this paper we present an extensive experimental comparison of 15 different methods, on 15 different labelled-networks, as well as releasing all datasets and source code. In addition, we release a further set of networks that were not used in this study (as not all benchmarked methods could manage very large datasets). Besides the release of data, protocols and algorithms, the key contribution of this study is that in each of the 225 combinations we tested, the best performance—both in accuracy and running time—was achieved by the same algorithm: Online Majority Vote. This is also one of the simplest methods to implement.

[1]  Nello Cristianini,et al.  Predicting relations in news-media content among EU countries , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[2]  Stanley Milgram,et al.  An Experimental Study of the Small World Problem , 1969 .

[3]  Andrei Z. Broder,et al.  Generating random spanning trees , 1989, 30th Annual Symposium on Foundations of Computer Science.

[4]  Nello Cristianini,et al.  The Structure of the EU Mediasphere , 2010, PloS one.

[5]  T. D. Wilson Review of: Boslaugh, Sarah and Watters, Paul Andrew Statistics in a nutshell. Sebastopol, CA: O'Reilly, 2008 , 2008, Inf. Res..

[6]  Claudio Gentile,et al.  See the Tree Through the Lines: The Shazoo Algorithm , 2011, NIPS.

[7]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[8]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[9]  Nello Cristianini,et al.  Information Fusion for Entity Matching in Unstructured Data , 2010, AIAI.

[10]  Paul A. Watters,et al.  Statistics in a nutshell - a desktop quick reference , 2008 .

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  David Bruce Wilson,et al.  Generating random spanning trees more quickly than the cover time , 1996, STOC '96.

[13]  Nello Cristianini,et al.  NOAM: news outlets analysis and monitoring system , 2011, SIGMOD '11.

[14]  Mark Herbster,et al.  Fast Prediction on a Tree , 2008, NIPS.

[15]  Nello Cristianini,et al.  Automating News Content Analysis: An Application to Gender Bias and Readability , 2010, WAPA.

[16]  D. Shaywitz,et al.  Found in translation , 2007, Nature Biotechnology.

[17]  B. Schölkopf,et al.  Prediction on a Graph with a Perceptron , 2007 .

[18]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[19]  Claudio Gentile,et al.  Random Spanning Trees and the Prediction of Weighted Graphs , 2010, ICML.