论文信息 - UniNE at PAN-CLEF 2019: Bots and Gender Task

UniNE at PAN-CLEF 2019: Bots and Gender Task

When participating in the “bots and gender” subtask (both in English and Spanish), our aim is to automatically detect different text sources (sequence of tweets sent by a bot or a human). When a text is identified as being sent by humans, the system must determine the author’s gender (author profiling). To solve these questions, we focus on a simple classifier (k-NN, k = 5) usually able to produce a correct answer but not in an efficient way. Thus, we apply a feature selection procedure to reduce the number of terms (around 200 to 500). We also propose to apply a Zeta model to reduce the number of decisions taken by the kNN classifier. In this case, we focus on terms used in one category and ignored or used rarely by the second. In addition, the Type-Token Ratio of the lexical density (LD) presents some merit to discriminate between tweets sent by a bot (TTR < 0.2, LD ≥ 0.8) or humans (TTR ≥ 0.2, LD < 0.8).

[1] J. Pennebaker,et al. The Secret Life of Pronouns , 2003, Psychological science.

[2] Donna K. Harman,et al. How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[3] Jacques Savoy,et al. Analysis of the style and the rhetoric of the 2016 US presidential primaries , 2018, Digit. Scholarsh. Humanit..

[4] Margaret L. Kern,et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[5] Paolo Rosso,et al. Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[6] M. D. Rijke,et al. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF , 2019, Information Retrieval Evaluation in a Changing World.

[7] Jacques Savoy,et al. Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[8] Benno Stein,et al. A Decade of Shared Tasks in Digital Text Forensics at PAN , 2019, ECIR.

[9] Benno Stein,et al. TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[10] Jean Aitchison,et al. Language and the Internet , 2002, Lit. Linguistic Comput..

[11] John Burrows,et al. All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[12] J. Pennebaker,et al. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[13] Hugh Craig,et al. Shakespeare, Computers, and the Mystery of Authorship: Plays in the corpus , 2009 .

[14] Jacques Savoy,et al. Distance measures in author profiling , 2017, Information Processing & Management.