Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling

This paper reports on the PAN 2014 evaluation lab which hosts three shared tasks on plagiarism detection, author identification, and author profiling. To improve the reproducibility of shared tasks in general, and PAN’s tasks in particular, the Webis group developed a new web service called TIRA, which facilitates software submissions. Unlike many other labs, PAN asks participants to submit running softwares instead of their run output. To deal with the organizational overhead involved in handling software submissions, the TIRA experimentation platform helps to significantly reduce the workload for both participants and organizers, whereas the submitted softwares are kept in a running state. This year, we addressed the matter of responsibility of successful execution of submitted softwares in order to put participants back in charge of executing their software at our site. In sum, 57 softwares have been submitted to our lab; together with the 58 software submissions of last year, this forms the largest collection of softwares for our three tasks to date, all of which are readily available for further analysis. The report concludes with a brief summary of each task.

[1]  Hugo Jair Escalante,et al.  Using Intra-Profile Information for Author Profiling Notebook for PAN at CLEF 2014 , 2014 .

[2]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[3]  Alan F. Smeaton,et al.  Multilingual and Multimodal Information Access Evaluation, International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, September 20-23, 2010. Proceedings , 2010, CLEF.

[4]  Amit Prakash,et al.  Experiments on Document Chunking and Query Formation for Plagiarism Source Retrieval , 2014, CLEF.

[5]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[6]  Simon Suchomel,et al.  Heterogeneous Queries for Synoptic and Phrasal Search Notebook for PAN at CLEF 2014 , 2014 .

[7]  Yue Lu,et al.  Latent aspect rating analysis on review text data: a rating regression approach , 2010, KDD.

[8]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[9]  Moshe Koppel,et al.  Determining if two documents are written by the same author , 2014, J. Assoc. Inf. Sci. Technol..

[10]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[11]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[12]  Walter Daelemans,et al.  CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text , 2014, LREC.

[13]  Simon Suchomel,et al.  Heterogeneous Queries for Synoptic and Phrasal Search , 2014, CLEF.

[14]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[15]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[16]  Hugo Jair Escalante,et al.  Particle Swarm Model Selection for Authorship Verification , 2009, CIARP.

[17]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[18]  Michal Meina,et al.  Ensemble-based Classification for Author Profiling Using Various Features Notebook for PAN at CLEF 2013 , 2013, CLEF.

[19]  Henning Müller,et al.  Assessing the Scholarly Impact of ImageCLEF , 2011, CLEF.

[20]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[21]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[22]  José Carlos González,et al.  DAEDALUS at PAN 2014: Guessing Tweet Author's Gender and Age , 2014, CLEF.

[23]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[24]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[25]  Thamar Solorio,et al.  A Simple Approach to Author Profiling in MapReduce , 2014, CLEF.

[26]  Alexander F. Gelbukh,et al.  Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition , 2015, CLEF.

[27]  Clifton B. Kruse Jr. Esq. How Old Do You Think I Am , 2001 .

[28]  José Palazzo Moreira de Oliveira,et al.  Exploring Information Retrieval features for Author Profiling Notebook for PAN at CLEF 2014 , 2014 .

[29]  Cathy Zhang,et al.  Predicting gender from blog posts , 2010 .

[30]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[31]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[32]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[33]  Adam Kilgarriff,et al.  Shared-Task Evaluations in HLT: Lessons for NLG , 2006, INLG.

[34]  Margareta Westergren Axelsson USE–The Uppsala Student English Corpus: an instrument for needs analysis , 2000 .

[35]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'13: Author Profiling Task Notebook for PAN at CLEF 2013 , 2013, CLEF.

[36]  J. Holmes,et al.  The handbook of language and gender , 2003 .

[37]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[38]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[39]  Yurii Palkovskii,et al.  Developing High-Resolution Universal Multi- Type N-Gram Plagiarism Detector Notebook for PAN at CLEF 2014 , 2014 .

[40]  Adolfo Jonathan Salinas-López,et al.  CEUR Workshop Proceedings , 2015 .

[41]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[42]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[43]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[44]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[45]  Djoerd Hiemstra,et al.  Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics , 2012, Lecture Notes in Computer Science.

[46]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[47]  Benno Stein,et al.  TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[48]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[49]  Ilya Sochenkov,et al.  Using Sentence Similarity Measure for Plagiarism Source Retrieval , 2014, CLEF.

[50]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[51]  Hung-Hsuan Chen,et al.  Unsupervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 , 2013, CLEF.

[52]  Matthias Hagen,et al.  Crowdsourcing Interaction Logs to Understand Text Reuse from the Web , 2013, ACL.

[53]  Carolyn Penstein Rosé,et al.  Proceedings of the 5th ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ACL 2011, 24 June, 2011, Portland, Oregon, USA , 2011 .

[54]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[55]  Luis Alvarez,et al.  Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications , 2012, Lecture Notes in Computer Science.

[56]  Matthias Hagen,et al.  From keywords to keyqueries: content descriptors for the web , 2013, SIGIR.

[57]  C. Lee Giles,et al.  Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2014 , 2014 .

[58]  Grigori Sidorov,et al.  A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 , 2014, CLEF.

[59]  Matthias Hagen,et al.  Exploratory Search Missions for TREC Topics , 2013, EuroHCIR.

[60]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[61]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[62]  Magdalena Jankowska,et al.  CNG Text Classification for Authorship Profiling Task Notebook for PAN at CLEF 2013 , 2013, CLEF.

[63]  Anselmo Peñas,et al.  A Simple Measure to Assess Non-response , 2011, ACL.

[64]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[65]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[66]  Sarah Steiner Gender, Genre, and Writing Style in Formal Written Texts , 2014 .

[67]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[68]  Shlomo Argamon,et al.  Overview of the International Authorship Identification Competition at PAN-2011 , 2011, CLEF.