The Language of Weblogs: A study of genre and individual differences

This thesis describes a linguistic investigation of individual differences in online personal diaries, or ‘blogs.’ There is substantial evidence of gender differences in language (Lakoff, 1975), and to a lesser extent linguistic projection of personality (Pennebaker & King, 1999). Recent work has investigated these latter differences in the area of computer-mediated communication (CMC), specifically e-mail (Gill, 2004). This thesis employs a number of analytic techniques, both top-down (dictionarybased) and bottom-up (data-driven), in order to explore personality and gender differences in the language of blogs. A corpus was constructed by asking authors to submit a month of text and complete a sociobiographic questionnaire. The corpus consists of over 400,000 words and five-factor personality data (Buchanan, 2001) for 71 subjects. The thesis begins by framing blogs in the context of other genres, both CMC and traditional, in order to show both the distinctiveness and representativeness of the genre. Top-down content analysis techniques are then employed to investigate the relationship between personality and linguistic features. A number of features correlate with each trait, but upon regression, very little variance is explained. Bottom-up techniques are more successful. The corpus was stratified into high, low and neutral personality groups to identify distinctive collocations for each. Returning to the raw personality scores, it becomes clear that even a small amount of n-gram context helps account for much more variance in personality. A measure of contextuality (Heylighen & Dewaele, 2002) shows that authors considered high in Agreeableness pay more attention to differences between their extra-linguistic context and that of their audience. Attention turns to gender, where similar methods are applied to investigate gender differences in language. Many previous findings are confirmed in the blog corpus. In addition, women are found to write more in their blogs than men. More generally, using the British National Corpus, it is shown that women are more contextual, except in the least contextual of genres (academic writing) where there is no difference. The study concludes by confirming that both gender and personality are projected by language in blogs; furthermore, approaches which take the context of language features into account can be used to detect more variation than those which do not.

[1]  K. Popper The Poverty of Historicism , 1959 .

[2]  H. Eysenck Biological Basis of Personality , 1963, Nature.

[3]  V B CERVIN,et al.  PERSUASIVENESS AND PERSUASIBILITY AS RELATED TO INTELLIGENCE AND EXTRAVERSION. , 1965, The British journal of social and clinical psychology.

[4]  R.W. Ramsay,et al.  Speech Patterns and Personality , 1968, Language and speech.

[5]  A. Campbell,et al.  Bodily communication and personality. , 1978, The British journal of social and clinical psychology.

[6]  A. W. Siegman THE MEANING OF SILENT PAUSES IN THE INITIAL INTERVIEW , 1978, The Journal of nervous and mental disease.

[7]  P. Nichols Black Women in the Rural South: Conservative and Innovative , 1978 .

[8]  Carolyn R. Miller Genre as social action , 1984 .

[9]  H. Eysenck,et al.  A revised version of the Psychoticism scale. , 1985 .

[10]  A. Thorne The press of personality: A study of conversations between introverts and extraverts. , 1987 .

[11]  P. Costa,et al.  Validation of the five-factor model of personality across instruments and observers. , 1987, Journal of personality and social psychology.

[12]  Michael Wilson,et al.  MRC psycholinguistic database: Machine-usable dictionary, version 2.00 , 1988 .

[13]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[14]  W. Labov The intersection of sex and social class in the course of linguistic change , 1990, Language Variation and Change.

[15]  D. Tannen You just don't understand: women and men in conversation. morrow , 1990 .

[16]  J. M. Digman PERSONALITY STRUCTURE: EMERGENCE OF THE FIVE-FACTOR MODEL , 1990 .

[17]  A. Furnham Language and personality. , 1990 .

[18]  P. Costa,et al.  Facet Scales for Agreeableness and Conscientiousness: A Revision of the NEO Personality Inventory☆ , 1991 .

[19]  S. Murray You just don't understand: Women and men in conversation , 1992 .

[20]  J. S. Wiggins,et al.  Personality: structure and assessment. , 1992, Annual review of psychology.

[21]  W. Orlikowski,et al.  Genres of Organizational Communication: A Structurational Approach to Studying Communication and Media , 1992 .

[22]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[23]  Hans J. Eysenck,et al.  From DNA to Social Behaviour: Conditions for a Paradigm of Personality Research , 1993 .

[24]  Catherine N. Ball Automated Text Analysis: Cautionary Tales , 1993 .

[25]  R. Lynn,et al.  Sex differences in competitiveness and the valuation of money in twenty countries. , 1993, The Journal of social psychology.

[26]  P. Kline Handbook of Psychological Testing , 2013 .

[27]  L. R. Goldberg The structure of phenotypic personality traits. , 1993, The American psychologist.

[28]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[29]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[30]  Donald W. Hine,et al.  The Role of Verbal Behavior in the Encoding and Decoding of Interpersonal Dispositions , 1994 .

[31]  Christopher D. B. Burt,et al.  Prospective and retrospective account-making in diary entries: A model of anxiety reduction and avoidance , 1994 .

[32]  R. R. Abidin Parenting Stress Index: Professional Manual . Odessa, FL: Psychological Assessment Resources , 1995 .

[33]  Richard E. Yellen,et al.  Extraversion and introversion in electronically-supported meetings , 1995, Inf. Manag..

[34]  A. U. Chamot,et al.  The Good Language Learner , 1996 .

[35]  Simeon Yates,et al.  Oral and written linguistic aspects of computer conferencing : A corpus based study , 1996 .

[36]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[37]  Susan C. Cloninger Personality: Description, Dynamics, and Development , 1996 .

[38]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[39]  W. Mischel Personality and Assessment , 1996 .

[40]  M. Collot,et al.  Electric language : A new variety of English , 1996 .

[41]  Jean-Marc Dewaele,et al.  How to measure formality of speech? A model of synchronic variation , 1996 .

[42]  S. Herring Computer-mediated communication : linguistic, social and cross-cultural perspectives , 1996 .

[43]  C. Werry Linguistic and interactional features of Internet relay chat , 1996 .

[44]  Ted Pedersen,et al.  Fishing for Exactness , 1996, ArXiv.

[45]  Adam Kilgarriff,et al.  Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[46]  Kevin Crowston,et al.  Reproduced and emergent genres of communication on the World-Wide Web , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[47]  P. Costa,et al.  Personality trait structure as a human universal. , 1997, The American psychologist.

[48]  P. Eckert Gender and sociolinguistic variation , 1997 .

[49]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[50]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[51]  Olle Bälter,et al.  Electronic mail in a working context , 1998 .

[52]  Jean-Marc Dewaele,et al.  Speech rate variation in 2 oral styles of advanced French interlanguage , 1998 .

[53]  I. Deary,et al.  Personality Traits, 2nd Edition , 1998 .

[54]  A. Furnham,et al.  Extraversion: The Unloved Variable in Applied Linguistic Research , 1999 .

[55]  J. Pennebaker,et al.  Linguistic styles: language use as an individual difference. , 1999, Journal of personality and social psychology.

[56]  Michael Mateas,et al.  An Oz-Centric Review of Interactive Drama and Believable Agents , 1999, Artificial Intelligence Today.

[57]  Michael A. Shepherd,et al.  The functionality attribute of cybergenres , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[58]  P. Costa,et al.  Nature over nurture: temperament, personality, and life span development. , 2000, Journal of personality and social psychology.

[59]  Clifford Nass,et al.  Does computer-generated speech manifest personality? an experimental test of similarity-attraction , 2000, CHI.

[60]  A. Furnham,et al.  Personality and speech production: a pilot study of second language learners , 2000 .

[61]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[62]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[63]  David Crystal,et al.  Language and the Internet , 2001 .

[64]  Michael Wilson MRC Psycholinguistic Database , 2001 .

[65]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[66]  James J. Bradac,et al.  Empirical Support for the Gender-as-Culture Hypothesis: An Intercultural Analysis of Male/Female Language Differences. , 2001 .

[67]  R. Thomson,et al.  Predicting gender from electronic discourse. , 2001, The British journal of social psychology.

[68]  J. Swales,et al.  Genre identification and communicative purpose: A problem and a possible solution , 2001 .

[69]  Naomi S. Baron Commas and canaries: the role of punctuation in speech and writing , 2001 .

[70]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[71]  C. Y. Peng,et al.  An Introduction to Logistic Regression Analysis and Reporting , 2002 .

[72]  Torill Mortensen,et al.  Blogging thoughts: personal publication as an online research tool , 2002 .

[73]  Ann Colley,et al.  Gender-Linked Differences in the Style and Content of E-Mails to Friends , 2002 .

[74]  Niki Panteli,et al.  Richness, power cues and email text , 2002, Inf. Manag..

[75]  Ylva Hård af Segerstad Use and Adaptation of Written Language to the Conditions of Computer-Mediated Communication , 2002 .

[76]  P. Markey,et al.  Interpersonal Perception in Internet Chat Rooms , 2002 .

[77]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[78]  Marco Perugini,et al.  Big Five Assessment , 2002 .

[79]  Jean-Marc Dewaele,et al.  Variation in the Contextuality of Language: An Empirical Measure , 2002 .

[80]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[81]  James W. Pennebaker,et al.  Language Use and Personality during Crises: Analyses of Mayor Rudolph Giuliani's Press Conferences , 2002 .

[82]  Jean‐Marc Dewaele Individual differences in L2 fluency: the effect of neurobiological correlates , 2002 .

[83]  Alastair J. Gill,et al.  Perception of e-mail personality at zero-acquaintance: Extraversion takes care of itself; Neuroticism is a worry , 2003 .

[84]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[85]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[86]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[87]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[88]  Paul Edward Rayson,et al.  Matrix : a statistical method and software tool for linguistic analysis through corpus comparison , 2003 .

[89]  Alastair J. Gill,et al.  Language With Character: A Stratified Corpus Comparison of Individual Differences in E-Mail Communication , 2006 .

[90]  Lois Ann Scheidt,et al.  Bridging the gap: a genre analysis of Weblogs , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[91]  Alastair J. Gill,et al.  Individual differences and implicit language: personality, parts-of-speech and pervasiveness , 2004 .

[92]  J. Pennebaker,et al.  Linguistic Markers of Psychological Change Surrounding September 11, 2001 , 2004, Psychological science.

[93]  Bonnie A. Nardi,et al.  Blogging as social activity, or, would you let 900 million people read your diary? , 2004, CSCW.

[94]  Torill Elvira Mortensen Personal Publication and Public Attention , 2004 .

[95]  Cameron A. Marlow Audience, structure and authority in the weblog community , 2004 .

[96]  Alastair J. Gill,et al.  Interpersonality: Individual differences and interpersonal priming , 2004 .

[97]  Alastair J. Gill Personality and language: the projection and perception of personality in computer-mediated communication , 2004 .

[98]  Arthur C. Graesser,et al.  Variation in Language and Cohesion across Written and Spoken Registers , 2004 .

[99]  Kristin Helen Andersen Student’s Use of Weblogs. Weblogs for Collaboration in an Educational Setting , 2004 .

[100]  S. Herring,et al.  Women and Children Last: The Discursive Construction of Weblogs , 2004 .

[101]  S. Gosling,et al.  Should we trust web-based studies? A comparative analysis of six preconceptions about internet questionnaires. , 2004, The American psychologist.

[102]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[103]  Dan Li WHY DO YOU BLOG: A USES-AND-GRATIFICATIONS INQUIRY INTO BLOGGERS' MOTIVATIONS , 2005 .

[104]  Alistair Kennedy,et al.  Automatic Identification of Home Pages on the Web , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[105]  Jean‐Marc Dewaele,et al.  Investigating the Psychological and Emotional Dimensions in Instructed Language Learning: Obstacles and Possibilities , 2005 .

[106]  Lois Ann Scheidt,et al.  Weblogs as a bridging genre , 2005, Inf. Technol. People.

[107]  David A. Huffaker,et al.  Gender, Identity, and Language Use in Teenage Blogs , 2006, J. Comput. Mediat. Commun..

[108]  James W. Pennebaker,et al.  The Language of Love: Sex, Sexual Orientation, and Language Use in Online Personal Advertisements , 2005 .

[109]  Aldo de Moor,et al.  Beyond Personal Webpublishing: An Exploratory Study of Conversational Blogging Practices , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[110]  John A. Johnson,et al.  Implementing a five-factor personality inventory for use on the internet , 2005 .

[111]  Jon Oberlander,et al.  Weblogs, genres and individual differences , 2005 .

[112]  Steven Skiena,et al.  Newspapers vs. Blogs: Who Gets the Scoop? , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[113]  Tim Weninger,et al.  Collaborative and Structural Recommendation of Friends using Weblog-based Social Network Analysis , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[114]  John D. Burger,et al.  An Exploration of Observable Features Related to Blogger Age , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[115]  Maarten de Rijke,et al.  Learning to Recognize Blogs: A Preliminary Exploration , 2006 .

[116]  Richard Tong,et al.  Weblogs as Market Indicators: Tracking Reactions to Issues and Events , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[117]  Marina Santini,et al.  Interpreting Genre Evolution on the Web , 2006 .

[118]  Jussi Karlgren Proceedings of the workshop on New Text: Wikis and blogs and other dynamic text sources , 2006 .

[119]  Hugo Liu,et al.  A Corpus-based Approach to Finding Happiness , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[120]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[121]  Belle L. Tseng,et al.  Important Weblog Identification and Hot Story Summarization , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[122]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..