Authorship Attribution and Optical Character Recognition Errors

Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial. RESUME. Le probleme de l'attribution stylometrique d'auteur est un probleme fondamental. L'idee fondamentale derriere cette recherche est que l'on peut determiner la paternite d'un do- cument sur la base d'un ensemble de trait cognitifs et linguistiques qui permettent d'identifier de maniere unique le style d'ecriture d'une personne. Dans de nombreux cas, cependant, le bruit present dans les documents originaux peut rendre cette analyse plus difficile et moins fiable. Nous etudions les erreurs introduites par un processus typique de reconnaissance op- tique de caracteres (OCR). En utilisant des erreurs simulees (aleatoirement) dans un corpus de reference standard, nous evaluons la sensibilite au bruit du processus d'attribution d'auteur. Nos resultats indiquent que, bien que la precision diminue avec un niveau de bruit, cette baisse n'est pas substantielle.

[1]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[2]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[3]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[4]  W A Eltis The Fundamental Problem , 1981 .

[5]  Darren M. Vescovi Best Practices in Authorship Attribution of English Essays , 2011 .

[6]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[7]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[8]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[9]  Patrick Brennan,et al.  A Prototype for Authorship Attribution Studies , 2006, Lit. Linguistic Comput..

[10]  Patrick Juola,et al.  An Overview of the Traditional Authorship Attribution Subtask , 2012, CLEF.

[11]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[12]  Edwin Klijn The Current State-of-art in Newspaper Digitization: A Market Perspective , 2008, D Lib Mag..

[13]  David F. Epstein The federalist , 1986 .

[14]  Gordon W. Paynter,et al.  Going Grey? Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images , 2009, D Lib Mag..

[15]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[16]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[17]  G. C. Tiao,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[18]  E. Grafstein,et al.  Close only counts in horseshoes and... triage? , 2004, CJEM.

[19]  F. Cross,et al.  The Oxford Dictionary of the Christian Church , 1997 .

[20]  S. Argamon,et al.  The “Fundamental Problem” of Authorship Attribution , 2012 .

[21]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[22]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[23]  G. G. Attridge The Characteristic Curve , 1991 .

[24]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[25]  Rand R. Wilcox A Fundamental Problem , 2010 .

[26]  F. L. Wellman The Art of Cross-Examination , 1936 .

[27]  Louis A. Penner,et al.  A value analysis of the disputed Federalist papers. , 1970 .

[28]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[29]  Patrick Juola,et al.  Empirical evaluation of authorship obfuscation using JGAAP , 2010, AISec '10.

[30]  Patrick Juola,et al.  Authorship Attribution for Electronic Documents , 2006, IFIP Int. Conf. Digital Forensics.

[31]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[32]  G. Wainwright The Oxford Dictionary of the Christian Church . Edited by F. L. Cross; third edition edited by E. A. Livingstone. Oxford: Oxford University Press, 1997. xxxvii + 1,786 pp. $125.00 cloth. , 1998, Church History.

[33]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[34]  E. Best A commentary on the First and Second Epistles to The Thessalonians , 1987 .

[35]  Linn Marks Collins,et al.  The fierce urgency of now: a proactive, pervasive content awareness tool , 2009 .

[36]  Joseph N. Ulman,et al.  The Art of Cross-Examination , 1936 .

[37]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..