Authorship Attribution of Noisy Text Data With a Comparative Study of Clustering Methods

Throughthefastdevelopmentandintensificationofthelargevolumeofdataviatheinternet,visual analytics(VA)comesoutwiththeintentionofvisualizingmultidimensionaldataindifferentways, whichrevealsinterestinginformationaboutthedata,makingthemclearerandmoreintelligible.In thisinvestigation,theauthorsfocusedontheVAbasedAuthorshipAttribution(AA)task,applied on noisy text data. Furthermore, this article proposes 3D Visual Analytics technique based on sphereimplementation.Theuseddatasetcontainsseveral textdocumentswrittenby5American Philosophers,withanaveragelengthof850wordspertext,whichwerescannedandthencorrupted withdifferentnoiselevels.Theobtainedresultsshowthatthehierarchicalclusteringtechniqueusing afully-automatedthreshold,presentshighperformanceintermsofauthorshipattributionaccuracy, especiallywithcharactertrigramsandendingbigrams,wheretheclusteringrecognitionrate(CRR) reachesanaccuracyof100%atnoiselevels:from0%to7%.Inaddition,theproposed3Dsphere techniqueappearsquiteinterestingbyshowinghighclusteringperformances,mainlywithWords. KeyWoRDS 3D Sphere Visualisation, Artificial Intelligence, Authorship Attribution, Clustering, GMM, Noisy Text, Visual Analytics

[1]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[2]  Halim Sayoud A Visual Analytics based Investigation on the Authorship of the Holy Quran , 2015, IVAPP.

[3]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[4]  Firas Ajil Jassim Kriging Interpolation Filter to Reduce High Density Salt and Pepper Noise , 2013, ArXiv.

[5]  Youssef Iraqi,et al.  An evaluation of authorship attribution using random forests , 2015, 2015 International Conference on Information and Communication Technology Research (ICTRC).

[6]  Santiago Segarra,et al.  Authorship Attribution Through Function Word Adjacency Networks , 2014, IEEE Transactions on Signal Processing.

[7]  Bojan Nastav,et al.  Hierarchical clustering with concave data sets , 2005, Advances in Methodology and Statistics.

[8]  Liviu P. Dinu,et al.  Pastiche Detection Based on Stopword Rankings. Exposing Impersonators of a Romanian Writer , 2012 .

[9]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[10]  Versha Rani,et al.  A BRIEF STUDY OF VARIOUS NOISE MODEL AND FILTERING TECHNIQUES , 2013 .

[11]  Marcos Aurélio Domingues,et al.  Privileged Information for Hierarchical Document Clustering: A Metric Learning Approach , 2014, 2014 22nd International Conference on Pattern Recognition.

[12]  Wei Li,et al.  An approach of hierarchical concept clustering on Medical Short Text corpus , 2013, 2013 6th International Conference on Biomedical Engineering and Informatics.

[13]  Stefan Trausan-Matu,et al.  Document clustering based on time series , 2015, 2015 19th International Conference on System Theory, Control and Computing (ICSTCC).

[14]  Rajendra Kumar Roul,et al.  Automated document indexing via intelligent hierarchical clustering: A novel approach , 2014, 2014 International Conference on High Performance Computing and Applications (ICHPCA).

[15]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[16]  Vlado Keselj,et al.  n-Gram-based classification and unsupervised hierarchical clustering of genome sequences , 2006, Comput. Methods Programs Biomed..

[17]  Tomi Kinnunen,et al.  Comparison of clustering methods: A case study of text-independent speaker modeling , 2011, Pattern Recognit. Lett..

[18]  Hayato Yamana,et al.  A challenge of authorship identification for ten-thousand-scale microblog users , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  Valerio Pascucci,et al.  Gaussian mixture model based volume visualization , 2012, IEEE Symposium on Large Data Analysis and Visualization (LDAV).

[20]  Flávio Bortolozzi,et al.  A computational approach for authorship attribution of literary texts using sintatic features , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[21]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[22]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[23]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[24]  P. Sengottuvelan,et al.  An Empirical Evaluation of Salt and Pepper Noise Removal for Document Images using Median Filter , 2013 .

[25]  Latifur Khan,et al.  Author attribution on streaming data , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[26]  S. Mercy Shalinie,et al.  Effect of multi-word features on the hierarchical clustering of web documents , 2014, 2014 International Conference on Recent Trends in Information Technology.

[27]  Harry Erwin,et al.  Correspondence Analysis of the New Testament. Workshop on Language Resources and Evaluation for Religious Texts , 2012 .

[28]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[29]  Shahriar Kaisar,et al.  Salt and Pepper Noise Detection and removal by Tolerance based Selective Arithmetic Mean Filtering Technique for image restoration , 2008 .

[30]  Jahid Ali,et al.  A Comparative Study of Various Types of Image Noise and Efficient Noise Removal Techniques , 2013 .

[31]  Jiann-Liang Chen,et al.  IoT-IMS Communication Platform for Future Internet , 2011, Int. J. Adapt. Resilient Auton. Syst..

[32]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[33]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[34]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[35]  Fabio Crestani,et al.  Finding Participants in a Chat: Authorship Attribution for Conversational Documents , 2013, 2013 International Conference on Social Computing.

[36]  Rachel Greenstadt,et al.  Translate Once, Translate Twice, Translate Thrice and Attribute: Identifying Authors and Machine Translation Tools in Translated Text , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[37]  A. Bevan The data deluge , 2015, Antiquity.

[38]  Daniel A. Keim,et al.  Literature Fingerprinting: A New Method for Visual Literary Analysis , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[39]  Peter Butka,et al.  Extraction of keyphrases from single document based on hierarchical concepts , 2016, 2016 IEEE 14th International Symposium on Applied Machine Intelligence and Informatics (SAMI).

[40]  Daniel A. Keim,et al.  Visual Analytics Challenges , 2009 .

[41]  Christian Napoli,et al.  Authorship Semantical Identification Using Holomorphic Chebyshev Projectors , 2015, 2015 Asia-Pacific Conference on Computer Aided System Engineering.

[42]  Jacques Ferber,et al.  Emergence in Agent based Computational Social Science: conceptual, formal and diagrammatic analysis. , 2008 .

[43]  Daniel A. Keim,et al.  Knowledge Generation Model for Visual Analytics , 2014, IEEE Transactions on Visualization and Computer Graphics.

[44]  Roman V. Yampolskiy,et al.  Evaluation of authorship attribution software on a Chat bot corpus , 2011, 2011 XXIII International Symposium on Information, Communication and Automation Technologies.

[45]  Ludovic Tanguy,et al.  Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce , 2012, CLEF.

[46]  Ludovic Henrio,et al.  Mixing Workflows and Components to Support Evolving Services , 2010, Int. J. Adapt. Resilient Auton. Syst..