Provenance-Assisted Classification in Social Networks

Signal feature extraction and classification are two common tasks in the signal processing literature. This paper investigates the use of source identities as a common mechanism for enhancing the classification accuracy of social signals. We define social signals as outputs, such as microblog entries, geotags, or uploaded images, contributed by users in a social network. Many classification tasks can be defined on such outputs. For example, one may want to identify the dialect of a microblog contributed by an author, or classify information referred to in a user's tweet as true or false. While the design of such classifiers is application-specific, social signals share in common one key property: they are augmented by the explicit identity of the source. This motivates investigating whether or not knowing the source of each signal (in addition to exploiting signal features) allows the classification accuracy to be improved. We call it provenance-assisted classification. This paper answers the above question affirmatively, demonstrating how source identities can improve classification accuracy, and derives confidence bounds to quantify the accuracy of results. Evaluation is performed in two real-world contexts: (i) fact-finding that classifies microblog entries into true and false, and (ii) language classification of tweets issued by a set of possibly multi-lingual speakers. We also carry out extensive simulation experiments to further evaluate the performance of the proposed classification scheme over different problem dimensions. The results show that provenance features significantly improve classification accuracy of social signals, even when no information is known about the sources (besides their ID). This observation offers a general mechanism for enhancing classification results in social networks.

[1]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[2]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[3]  Lance Kaplan,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[4]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[5]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[6]  Charu C. Aggarwal,et al.  On Credibility Estimation Tradeoffs in Assured Social Sensing , 2013, IEEE Journal on Selected Areas in Communications.

[7]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[8]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Shaojie Tang,et al.  Relationship classification in large scale online social networks and its impact on information propagation , 2011, 2011 Proceedings IEEE INFOCOM.

[11]  Todd K. Moon,et al.  A Generalized BCJR Algorithm and Its Use in Iterative Blind Channel Identification , 2007, IEEE Signal Processing Letters.

[12]  Charu C. Aggarwal,et al.  On scalability and robustness limitations of real and asymptotic confidence bounds in social sensing , 2012, 2012 9th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON).

[13]  Graham Cormode,et al.  Node Classification in Social Networks , 2011, Social Network Data Analytics.

[14]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[15]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[16]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[17]  Divesh Srivastava,et al.  Global detection of complex copying relationships between sources , 2010, Proc. VLDB Endow..

[18]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[20]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[21]  Sujay Sanghavi,et al.  Learning the graph of epidemic cascades , 2012, SIGMETRICS '12.

[22]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[24]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[25]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[26]  ChengXiang Zhai,et al.  A Note on the Expectation-Maximization (EM) Algorithm , 2004 .

[27]  Tarek F. Abdelzaher,et al.  Maximum likelihood analysis of conflicting observations in social sensing , 2014, TOSN.

[28]  Clare R. Voss,et al.  Tweet Conversation Annotation Tool with a Focus on an Arabic Dialect, Moroccan Darija , 2013, LAW@ACL.