Measurement of online social networks

In recent years, the popularity of online social networks (OSN) has risen to unprecedented levels, with the most popular ones having hundreds of millions of users. This success has generated interest within the networking community and has given rise to a number of measurement and characterization studies, which provide a first step towards their understanding. The large size and access limitations of most online social networks make it difficult to obtain a full view of users and their relations. Sampling methods are thus essential for practical estimation of OSN properties. Key to OSN sampling schemes is the fact that users are, by definition, connected to one another via some relation. Therefore, samples of OSN users can be obtained by exploring the OSN social graph or graphs induced by other relations between users. While sampling can, in principle, allow precise inference from a relatively small number of observations, this depends critically on the ability to collect a sample with known statistical properties. An early family of measurement studies followed Breadth-First-Search (BFS) type approaches, where all nodes of a graph reachable from an initial seed were explored exhaustively. In this thesis, we follow a more principled approach: we perform random walks on the social graph to collect uniform samples of OSN users, which are representative and appropriate for further statistical analysis. First, we provide an instructive comparison of different graph exploration techniques and apply a number of known but perhaps underutilized methods to this problem. We show that previously used BFS-type methods can produce biased samples with poor statistical properties when the full graph is not covered, while randoms walks perform remarkably well. We also demonstrate how to measure online convergence for random walk-based approaches. Second, we propose multigraph sampling, a novel technique that performs a random walk on a combination of OSN user relations. Performed properly, multigraph sampling can improve mixing time and yield an asymptotic probability sample of a target population even where no single connected relation on the same population is available. Third, we apply the presented methods to collect some of the first known unbiased samples of large scale OSNs. An important part of this collection is the development of efficient crawlers that address the related technical challenges. Using the collected datasets we present characterization studies of Facebook and Last.fm. Finally we present the first study to characterize the statistical properties of OSN applications and propose a method to model the application installation process.