Online Social Network Profile Linkage

Piecing together social signals from people in different online social networks is key for downstream analytics. However, users may have different usernames in different social networks, making the linkage task difficult. To enable this, we explore a probabilistic approach that uses a domain-specific prior knowledge to address this problem of online social network user profile linkage. At scale, linkage approaches that are based on a naive pairwise comparisons that have quadratic complexity become prohibitively expensive. Our proposed threshold-based canopying framework – named OPL – reduces this pairwise comparisons, and guarantees a upper bound theoretic linear complexity with respect to the dataset size. We evaluate our approaches on real-world, large-scale datasets obtained from Twitter and Linkedin. Our probabilistic classifier integrating prior knowledge into Naive Bayes performs at over 85% F 1-measure for pairwise linkage, comparable to state-of-the-art approaches.

[1]  Claude Castelluccia,et al.  How Unique and Traceable Are Usernames? , 2011, PETS.

[2]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[3]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[4]  Virgílio A. F. Almeida,et al.  Studying User Footprints in Different Online Social Networks , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[5]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[6]  Pável Calado,et al.  Efficient and Effective Duplicate Detection in Hierarchical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[7]  Pável Calado,et al.  Resolving user identities over social networks through supervised learning and rich similarity features , 2012, SAC '12.

[8]  Federica Cena,et al.  User identification for cross-system personalisation , 2009, Inf. Sci..

[9]  Muhammad Abulaish,et al.  An MCL-Based Text Mining Approach for Namesake Disambiguation on the Web , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[10]  Bartunov Sergey,et al.  Joint Link-Attribute User Identity Resolution in Online Social Networks , 2012 .

[11]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[12]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[13]  Vincent Y. Shen,et al.  User identification across multiple social networks , 2009, 2009 First International Conference on Networked Digital Technologies.

[14]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[15]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[17]  Qinghua Zheng,et al.  Combining machine learning and human judgment in author disambiguation , 2011, CIKM '11.

[18]  Reza Zafarani,et al.  Connecting users across social media sites: a behavioral-modeling approach , 2013, KDD.

[19]  Fan Zhang,et al.  What's in a name?: an unsupervised approach to link users across communities , 2013, WSDM.

[20]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[21]  Li Qian,et al.  Sample-driven schema mapping , 2012, SIGMOD Conference.