Author Classification using Generalized Discriminant Analysis

Classification by document authorship based on statistical analysis — stylometry — is considered here by using feature vectors obtained from counts of all words in the intersecting sets of the training data. This differs from some previous stylometry, which used only selected “noncontextual” words with the highest counts, and also from conventional text search techniques, where noncontextual words are frequently left out when the term-by-document matrices are formed. The dimensionality of the resulting vector is reduced using a generalized discriminant analysis (GDA). The method is tested on three sets of documents which have been previously subjected to statistical analysis. Results show that the method is successful at identifying author differences and at classifying unknown authorship, consistent with previous techniques.

[1]  A. C. Rencher,et al.  Who Wrote the Book of Mormon? An Analysis of Wordprints , 1980 .

[2]  Haesun Park,et al.  Equivalence of Several Two-Stage Methods for Linear Discriminant Analysis , 2004, SDM.

[3]  J. Hagenauer,et al.  Information Theory Helps Historians , 2005 .

[4]  J. Springer A Mechanical Solution of a Literary Problem , 1923 .

[5]  J. Hilton On Verifying Wordprint Studies: Book of Mormon Authorship , 1990 .

[6]  A. Ellegård A statistical method for determining authorship : the Junius letters, 1769-1772 , 1962 .

[7]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[8]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[9]  R. D. Lord Studies in the history of probability and statistics. VIII. De Morgan and the Statistical study of literary style , 1958 .

[10]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[11]  Gene H. Golub,et al.  Matrix computations , 1983 .

[12]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[13]  A. Webb,et al.  Introduction to Statistical Pattern Recognition , 2003 .

[14]  Haesun Park,et al.  Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition , 2003, SIAM J. Matrix Anal. Appl..

[15]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[16]  M. Saunders,et al.  Towards a Generalized Singular Value Decomposition , 1981 .

[17]  Claude S. Brinegar,et al.  Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship , 1963 .

[18]  David F. Epstein The federalist , 1986 .

[19]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[20]  Jianlin Wang,et al.  Solving the small sample size problem in face recognition using generalized discriminant analysis , 2006, Pattern Recognit..