$CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph

Stylometry has been successfully applied to perform authorship identification of single-author documents (AISD). The AISD task is concerned with identifying the original author of an anonymous document from a group of candidate authors. However, AISD techniques are not applicable to the authorship identification of multi-author documents (AIMD). Unlike AISD, where each document is written by one single author, AIMD focuses on handling multi-author documents. Due to the combinatoric nature of documents, AIMD lacks the ground truth information—that is, information on writing and non-writing authors in a multi-author document—which makes this problem more challenging to solve. Previous AIMD solutions have a number of limitations: (i) the best stylometry-based AIMD solution has a low accuracy, less than 30%; (ii) increasing the number of co-authors of papers adversely affects the performance of AIMD solutions; and (iii) AIMD solutions were not designed to handle the non-writing authors (NWAs). However, NWAs exist in real-world cases—that is, there are papers for which not every co-author listed has contributed as a writer. This paper proposes an AIMD framework called the Co-Authorship Graph that can be used to (i) capture the stylistic information of each author in a corpus of multi-author documents and (ii) make a multi-label prediction for a multi-author query document. We conducted extensive experimental studies on one synthetic and three real-world corpora. Experimental results show that our proposed framework (i) significantly outperformed competitive techniques; (ii) can effectively handle a larger number of co-authors in comparison with competitive techniques; and (iii) can effectively handle NWAs in multi-author documents.

[1]  Qing Li,et al.  A scalable framework for cross-lingual authorship identification , 2018, Inf. Sci..

[2]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[3]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[4]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[5]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[6]  Francisco Herrera,et al.  Predicting literature's early impact with sentiment analysis in Twitter , 2020, Knowl. Based Syst..

[7]  W. Sutherland,et al.  Languages Are Still a Major Barrier to Global Science , 2016, PLoS biology.

[8]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[9]  Claudia Hauff,et al.  Large-scale author verification: temporal and topical influences , 2014, SIGIR.

[10]  Kim-Kwang Raymond Choo,et al.  Astroturfing Detection in Social Media: Using Binary n-Gram Analysis for Authorship Attribution , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[11]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[12]  Ling Huang,et al.  What You Submit Is Who You Are: A Multimodal Approach for Deanonymizing Scientific Publications , 2015, IEEE Transactions on Information Forensics and Security.

[13]  Rachel Greenstadt,et al.  Stylometric Authorship Attribution of Collaborative Documents , 2017, CSCML.

[14]  Sarana Nutanong,et al.  A Scalable Framework for Stylometric Analysis Query Processing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[15]  Claudio A. Perez,et al.  Gender Classification From NIR Images by Using Quadrature Encoding Filters of the Most Relevant Features , 2019, IEEE Access.

[16]  Korris Fu-Lai Chung,et al.  Leveraging Label-Specific Discriminant Mapping Features for Multi-Label Learning , 2019, ACM Trans. Knowl. Discov. Data.

[17]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[18]  Saad Awadh Alanazi Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis , 2019, IEEE Access.

[19]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[20]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[21]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[22]  Anil K. Jain,et al.  A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[23]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[24]  Saeed-Ul Hassan,et al.  Tapping into intra- and international collaborations of the Organization of Islamic Cooperation states across science and technology disciplines , 2016 .

[25]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[26]  Hayri Volkan Agun,et al.  Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution , 2019, IEEE Access.

[27]  Min Yang,et al.  Multi-task Learning for Author Profiling with Hierarchical Features , 2018, WWW.

[28]  Sarana Nutanong,et al.  An Effective and Scalable Framework for Authorship Attribution Query Processing , 2018, IEEE Access.

[29]  Benjamin C. M. Fung,et al.  A Visualizable Evidence-Driven Approach for Authorship Attribution , 2015, TSEC.

[30]  Imran Sarwar Bajwa,et al.  An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution , 2019, IEEE Access.

[31]  Shervin Malmasi,et al.  Multilingual native language identification , 2015, Natural Language Engineering.

[32]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Foster J. Provost,et al.  The myth of the double-blind review?: author identification using only citations , 2003, SKDD.

[34]  Patrick Gage Kelley,et al.  Author Identification from Citations , 2006 .

[35]  Jee-Hyong Lee,et al.  An approach for multi-label classification by directed acyclic graph with label correlation maximization , 2016, Inf. Sci..

[36]  Ariel Stolerman,et al.  Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[37]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[38]  Saeed-Ul Hassan,et al.  A bibliometric assessment of scientific productivity and international collaboration of the Islamic World in science and technology (S&T) areas , 2015, Scientometrics.

[39]  J. Milton,et al.  Language Independent Authorship Attribution using Character Level Language Models , 2003 .

[40]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[41]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[42]  Sarana Nutanong,et al.  A Scalable Framework for Stylometric Analysis of Multi-author Documents , 2018, DASFAA.

[43]  Sarana Nutanong,et al.  StyloThai: : A Scalable Framework for Stylometric Authorship Identification of Thai Documents , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[44]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[45]  Zongkai Yang,et al.  Variable Length Character N-Gram Approach for Online Writeprint Identification , 2010, 2010 International Conference on Multimedia Information Networking and Security.

[46]  Saeed-Ul Hassan,et al.  A Bibliometric Perspective on Technology-Driven Innovation in the Gulf Cooperation Council (GCC) Countries in Relation to Its Transformative Impact on International Business , 2019, Practice, Progress, and Proficiency in Sustainability.

[47]  C. Holmes,et al.  A probabilistic nearest neighbour method for statistical pattern recognition , 2002 .

[48]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[49]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[50]  C. Feng,et al.  Bibliometrics study on authorship trends in periodontal literature from 1995 to 2010. , 2014, Journal of periodontology.

[51]  Sarana Nutanong,et al.  The Key Factors and Their Influence in Authorship Attribution , 2016, Res. Comput. Sci..

[52]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[53]  Gordon Lucas,et al.  Authorship Attribution in Fan-fictional Texts given Variable Length Character and Word n-grams , 2019, CLEF.

[54]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[55]  Hongyu Guo,et al.  Syntax Encoding with Application in Authorship Attribution , 2018, EMNLP.

[56]  Nader Ale Ebrahim,et al.  Major trends in knowledge management research: a bibliometric study , 2016, Scientometrics.

[57]  Benjamin C. M. Fung,et al.  Learning Stylometric Representations for Authorship Analysis , 2016, IEEE Transactions on Cybernetics.