论文信息 - $CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph

$CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph

Stylometry has been successfully applied to perform authorship identification of single-author documents (AISD). The AISD task is concerned with identifying the original author of an anonymous document from a group of candidate authors. However, AISD techniques are not applicable to the authorship identification of multi-author documents (AIMD). Unlike AISD, where each document is written by one single author, AIMD focuses on handling multi-author documents. Due to the combinatoric nature of documents, AIMD lacks the ground truth information—that is, information on writing and non-writing authors in a multi-author document—which makes this problem more challenging to solve. Previous AIMD solutions have a number of limitations: (i) the best stylometry-based AIMD solution has a low accuracy, less than 30%; (ii) increasing the number of co-authors of papers adversely affects the performance of AIMD solutions; and (iii) AIMD solutions were not designed to handle the non-writing authors (NWAs). However, NWAs exist in real-world cases—that is, there are papers for which not every co-author listed has contributed as a writer. This paper proposes an AIMD framework called the Co-Authorship Graph that can be used to (i) capture the stylistic information of each author in a corpus of multi-author documents and (ii) make a multi-label prediction for a multi-author query document. We conducted extensive experimental studies on one synthetic and three real-world corpora. Experimental results show that our proposed framework (i) significantly outperformed competitive techniques; (ii) can effectively handle a larger number of co-authors in comparison with competitive techniques; and (iii) can effectively handle NWAs in multi-author documents.

[1] Qing Li,et al. A scalable framework for cross-lingual authorship identification , 2018, Inf. Sci..

[2] Zhi-Hua Zhou,et al. ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[3] E. Stamatatos. Ensemble-based Author Identification Using Character N-grams , 2006 .

[4] F. Mosteller,et al. Inference and Disputed Authorship: The Federalist , 1966 .

[5] Slav Petrov,et al. A Universal Part-of-Speech Tagset , 2011, LREC.

[6] Francisco Herrera,et al. Predicting literature's early impact with sentiment analysis in Twitter , 2020, Knowl. Based Syst..

[7] W. Sutherland,et al. Languages Are Still a Major Barrier to Global Science , 2016, PLoS biology.

[8] Efstathios Stamatatos,et al. Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[9] Claudia Hauff,et al. Large-scale author verification: temporal and topical influences , 2014, SIGIR.

[10] Kim-Kwang Raymond Choo,et al. Astroturfing Detection in Social Media: Using Binary n-Gram Analysis for Authorship Attribution , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[11] Jason Weston,et al. A kernel method for multi-labelled classification , 2001, NIPS.

[12] Ling Huang,et al. What You Submit Is Who You Are: A Multimodal Approach for Deanonymizing Scientific Publications , 2015, IEEE Transactions on Information Forensics and Security.

[13] Rachel Greenstadt,et al. Stylometric Authorship Attribution of Collaborative Documents , 2017, CSCML.

[14] Sarana Nutanong,et al. A Scalable Framework for Stylometric Analysis Query Processing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[15] Claudio A. Perez,et al. Gender Classification From NIR Images by Using Quadrature Encoding Filters of the Most Relevant Features , 2019, IEEE Access.

[16] Korris Fu-Lai Chung,et al. Leveraging Label-Specific Discriminant Mapping Features for Multi-Label Learning , 2019, ACM Trans. Knowl. Discov. Data.

[17] Efstathios Stamatatos,et al. On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[18] Saad Awadh Alanazi. Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis , 2019, IEEE Access.

[19] George M. Mohay,et al. Mining e-mail content for author identification forensics , 2001, SGMD.

[20] Jack Grieve,et al. Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[21] Jiebo Luo,et al. Learning multi-label scene classification , 2004, Pattern Recognit..

[22] Anil K. Jain,et al. A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[23] Dale Schuurmans,et al. Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[24] Saeed-Ul Hassan,et al. Tapping into intra- and international collaborations of the Organization of Islamic Cooperation states across science and technology disciplines , 2016 .

[25] Geoff Holmes,et al. Classifier chains for multi-label classification , 2009, Machine Learning.

[26] Hayri Volkan Agun,et al. Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution , 2019, IEEE Access.

[27] Min Yang,et al. Multi-task Learning for Author Profiling with Hierarchical Features , 2018, WWW.

[28] Sarana Nutanong,et al. An Effective and Scalable Framework for Authorship Attribution Query Processing , 2018, IEEE Access.

[29] Benjamin C. M. Fung,et al. A Visualizable Evidence-Driven Approach for Authorship Attribution , 2015, TSEC.

[30] Imran Sarwar Bajwa,et al. An Empirical Study on Forensic Analysis of Urdu Text Using LDA-Based Authorship Attribution , 2019, IEEE Access.

[31] Shervin Malmasi,et al. Multilingual native language identification , 2015, Natural Language Engineering.

[32] Daniel P. Huttenlocher,et al. Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Foster J. Provost,et al. The myth of the double-blind review?: author identification using only citations , 2003, SKDD.

[34] Patrick Gage Kelley,et al. Author Identification from Citations , 2006 .

[35] Jee-Hyong Lee,et al. An approach for multi-label classification by directed acyclic graph with label correlation maximization , 2016, Inf. Sci..

[36] Ariel Stolerman,et al. Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[37] Efstathios Stamatatos,et al. A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[38] Saeed-Ul Hassan,et al. A bibliometric assessment of scientific productivity and international collaboration of the Islamic World in science and technology (S&T) areas , 2015, Scientometrics.

[39] J. Milton,et al. Language Independent Authorship Attribution using Character Level Language Models , 2003 .

[40] Grigorios Tsoumakas,et al. Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[41] Jacques Savoy,et al. Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[42] Sarana Nutanong,et al. A Scalable Framework for Stylometric Analysis of Multi-author Documents , 2018, DASFAA.

[43] Sarana Nutanong,et al. StyloThai: : A Scalable Framework for Stylometric Authorship Identification of Thai Documents , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[44] Carole E. Chaski,et al. Empirical evaluations of language-based author identification techniques , 2001 .

[45] Zongkai Yang,et al. Variable Length Character N-Gram Approach for Online Writeprint Identification , 2010, 2010 International Conference on Multimedia Information Networking and Security.

[46] Saeed-Ul Hassan,et al. A Bibliometric Perspective on Technology-Driven Innovation in the Gulf Cooperation Council (GCC) Countries in Relation to Its Transformative Impact on International Business , 2019, Practice, Progress, and Proficiency in Sustainability.

[47] C. Holmes,et al. A probabilistic nearest neighbour method for statistical pattern recognition , 2002 .

[48] Rong Zheng,et al. From fingerprint to writeprint , 2006, Commun. ACM.

[49] Hsinchun Chen,et al. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[50] C. Feng,et al. Bibliometrics study on authorship trends in periodontal literature from 1995 to 2010. , 2014, Journal of periodontology.

[51] Sarana Nutanong,et al. The Key Factors and Their Influence in Authorship Attribution , 2016, Res. Comput. Sci..

[52] Fuchun Peng,et al. N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[53] Gordon Lucas,et al. Authorship Attribution in Fan-fictional Texts given Variable Length Character and Word n-grams , 2019, CLEF.

[54] Efstathios Stamatatos,et al. Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[55] Hongyu Guo,et al. Syntax Encoding with Application in Authorship Attribution , 2018, EMNLP.

[56] Nader Ale Ebrahim,et al. Major trends in knowledge management research: a bibliometric study , 2016, Scientometrics.

[57] Benjamin C. M. Fung,et al. Learning Stylometric Representations for Authorship Analysis , 2016, IEEE Transactions on Cybernetics.