Attribution d'auteur par ensembles de séparateurs

The authorship attribution problem can be viewed as a categorization problem. To discriminate between different writers (or categories), we must first select a list of useful features (word types in this study), and then we train our classifier. To improve effectiveness, we can consider an ensemble of models instead of a single classifier (bagging). In the current study, we propose two forms of variation: varying the author profiles on the one hand, and on the other, varying the list of selected features. To compare the effectiveness of these solutions, we have extracted two corpora from the Glasgow Herald written by five columnists, the first one is on sports (1,948 articles), and the second on politics (987 articles). Using the KLD model (Zhao & Zobel, 2007), we found that a simple classification scheme tends to produce results comparable to those obtained from using more complex ones. MOTS-CLES : Methode d’ensemble, attribution d'auteur, categorisation de textes.

[1]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[2]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[3]  Fred J. Damerau,et al.  The use of function word frequencies as indicators of style , 1975 .

[4]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[5]  Carol Peters,et al.  Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[12]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[13]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[14]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[15]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[16]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[17]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[18]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[19]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[20]  Joseph Rudman,et al.  The Twelve Disputed 'Federalist' Papers: A Case for Collaboration , 2012, DH.

[21]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[22]  Mark Steyvers,et al.  Detecting authorship deception: a supervised machine learning approach using author writeprints , 2012, Lit. Linguistic Comput..

[23]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[24]  Stephen R. Marsland,et al.  Machine Learning - An Algorithmic Perspective , 2009, Chapman and Hall / CRC machine learning and pattern recognition series.

[25]  J. A. Smith,et al.  Stylistic Constancy and Change Across Literary Corpora: Using Measures of Lexical Richness to Date Works , 2002, Comput. Humanit..

[26]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[27]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[28]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[29]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[30]  Bernard Ycart,et al.  Alberti's letter counts , 2012, Lit. Linguistic Comput..

[31]  H. Love Attributing Authorship: An Introduction , 2002 .

[32]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[33]  D. Hoover Corpus Stylistics, Stylometry, and the Styles of Henry James , 2007 .

[34]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[35]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  H. Sichel On a Distribution Law for Word Frequencies , 1975 .

[37]  Patrick Juola,et al.  The Time Course of Language Change , 2003, Comput. Humanit..

[38]  Jacques Savoy,et al.  Etude comparative de stratégies de sélection de prédicteurs pour l'attribution d'auteur , 2012, CORIA.

[39]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[40]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..