Content-based similarity measures of weblog authors

With recent research interest in the confounding roles of homophily and contagion in studies of social influence, there is a strong need for reliable content-based measures of the similarity between people. In this paper, we investigate the use of text similarity measures as a way of predicting the similarity of prolific weblog authors. We describe a novel method of collecting human judgments of overall similarity between two authors, as well as demographic, political, cultural, religious, values, hobbies/interests, personality, and writing style similarity. We then apply a range of automated textual similarity measures based on word frequency counts, and calculate their statistical correlation with human judgments. Our findings indicate that commonly used text similarity measures do not correlate well with human judgments of author similarity. However, various measures that pay special attention to personal pronouns and their context correlate significantly with different facets of similarity.

[1]  N. Christakis,et al.  The Spread of Obesity in a Large Social Network Over 32 Years , 2007, The New England journal of medicine.

[2]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[3]  James W. Pennebaker,et al.  Language Use and Personality during Crises: Analyses of Mayor Rudolph Giuliani's Press Conferences , 2002 .

[4]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[5]  Marilyn A. Walker,et al.  Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text , 2007, J. Artif. Intell. Res..

[6]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  Kristina Lerman,et al.  Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks , 2010, ICWSM.

[8]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[9]  R. Swanson,et al.  Identifying Personal Stories in Millions of Weblog Entries , 2009, ICWSM 2009.

[10]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[11]  D. Funder,et al.  Personality as manifest in word use: correlations with self-report, acquaintance report, and behavior. , 2008, Journal of personality and social psychology.

[12]  J. Pennebaker,et al.  Linguistic styles: language use as an individual difference. , 1999, Journal of personality and social psychology.

[13]  J. Pennebaker,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Words of Wisdom: Language Use Over the Life Span , 2003 .

[14]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[15]  Cosma Rohilla Shalizi,et al.  Homophily and Contagion Are Generically Confounded in Observational Social Network Studies , 2010, Sociological methods & research.

[16]  Scott Rosenberg Say Everything: How Blogging Began, What It's Becoming, and Why It Matters , 2009 .

[17]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[18]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[19]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[20]  A. Tellegen,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES An Alternative "Description of Personality": The Big-Five Factor Structure , 2022 .

[21]  J. Pennebaker,et al.  Linguistic Markers of Psychological Change Surrounding September 11, 2001 , 2004, Psychological science.

[22]  Tal Yarkoni Personality in 100,000 Words: A large-scale analysis of personality and word use among bloggers. , 2010, Journal of research in personality.

[23]  Russell Lyons,et al.  The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis , 2010, 1007.2876.

[24]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[25]  J. Pennebaker,et al.  Language use of depressed and depression-vulnerable college students , 2004 .

[26]  N. Christakis,et al.  SUPPLEMENTARY ONLINE MATERIAL FOR: The Collective Dynamics of Smoking in a Large Social Network , 2022 .