A discriminative random sampling strategy with individual-author feature selection for writeprint recognition of Chinese texts

Abstract The auto authorship recognition has become a novel technique to investigate cybercrimes. But the challenge of the research is that a huge number of features exist in the moderate-sized corpus, which causes the curse of over-training. Besides, it is hard to distinguish between potential authors only by a single feature set. In this paper, we proposed a random sampling style ensemble method with individual-author feature selection to exploit the high-dimensional feature space. The proposed method randomly picks writing-style features on each individual-author feature set (IAFS) partitioned from the whole feature set. The IAFSs are heuristically selected with training set of each author. Then, multiple base classifiers (BCs) are formed on the sampled feature sets. Finally, all BCs are fused to get a final decision. Experimental results on the real-life Chinese forum data verify the robustness of the proposed method compared with conventional ensemble methods. We also analyze the diversity of algorithm to reveal that the ensemble strategy is more effective and can construct more diverse BCs than random subspace methods.

[1]  Efstathios Stamatatos,et al.  Authorship Attribution Based on Feature Set Subspacing Ensembles , 2006, Int. J. Artif. Intell. Tools.

[2]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[3]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[4]  Xiaogang Wang,et al.  Random Sampling for Subspace Face Recognition , 2006, International Journal of Computer Vision.

[5]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[6]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[7]  Xiaogang Wang,et al.  Random sampling LDA for face recognition , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Isaac Woungang,et al.  Authorship verification of e-mail and tweet messages applied for continuous authentication , 2015, J. Comput. Syst. Sci..

[9]  Nasrullah Memon,et al.  CEAI: CCM based Email Authorship Identification Model , 2013, ArXiv.

[10]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[11]  Jussara M. Almeida,et al.  A quantitative analysis of the temporal effects on automatic text classification , 2016, J. Assoc. Inf. Sci. Technol..

[12]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[13]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[14]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[15]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[17]  Upul Bandara,et al.  Source code author identification with unsupervised feature learning , 2013, Pattern Recognit. Lett..

[18]  Benjamin C. M. Fung,et al.  E-mail authorship attribution using customized associative classification , 2015, Digit. Investig..

[19]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[21]  Zongkai Yang,et al.  Semi-random subspace method for writeprint identification , 2013, Neurocomputing.