Clustering : An Application with the State of the Union Addresses

This paper describes a clustering and authorship attribution study over the State of the Union addresses from 1790 to 2014 (224 speeches delivered by 41 presidents). To define the style of each presidency, we have applied a principal component analysis (PCA) based on the part‐of‐speech (POS) frequencies. From Roosevelt (1934), each president tends to own a distinctive style whereas previous presidents tend usually to share some stylistic aspects with others. Applying an automatic classification based on the frequencies of all content‐bearing word‐types we show that chronology tends to play a central role in forming clusters, a factor that is more important than political affiliation. Using the 300 most frequent word‐types, we generate another clustering representation based on the style of each president. This second view shares similarities with the first one, but usually with more numerous and smaller clusters. Finally, an authorship attribution approach for each speech can reach a success rate of around 95.7% under some constraints. When an incorrect assignment is detected, the proposed author often belongs to the same party and has lived during roughly the same time period as the presumed author. A deeper analysis of some incorrect assignments reveals interesting reasons justifying difficult attributions.

[1]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[2]  Thomas H Neale The President's State of the Union Message: Frequently Asked Questions , 2003 .

[3]  Matthew L. Jockers Macroanalysis: Digital Methods and Literary History , 2013 .

[4]  William L. Benoit,et al.  Issue Ownership and Presidential Campaigning, 1952–2000 , 2003 .

[5]  Alain Guénoche,et al.  Trees and proximity representations , 1991, Wiley-Interscience series in discrete mathematics and optimization.

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  D. Labbé,et al.  Le discours gouvernemental. Canada, Québec, France (1945-2000) , 2003 .

[8]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[9]  R. Harald Baayen,et al.  Analyzing linguistic data: a practical introduction to statistics using R, 1st Edition , 2008 .

[10]  Patrick Juola,et al.  Using the Google N-Gram corpus to measure cultural complexity , 2013, Lit. Linguistic Comput..

[11]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[12]  Jean Véronis,et al.  Les mots de Nicolas Sarkozy , 2008 .

[13]  Colleen J. Shogan,et al.  The President’s State of the Union Address: Tradition, Function, and Policy Implications , 2012 .

[14]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[15]  J. Pennebaker The Secret Life of Pronouns: What Our Words Say About Us , 2011 .

[16]  D. Labbé,et al.  Les mots qui nous gouvernent: le discours des premiers ministres québécois : 1960-2005 , 2011 .

[17]  Patrick Juola,et al.  The Time Course of Language Change , 2003, Comput. Humanit..

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[21]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[22]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[23]  Arjuna Tuzzi,et al.  The End of Year Addresses of the Presidents of the Italian Republic (1948-2006): discoursal similarities and differences , 2009, Glottometrics.

[24]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[25]  Franco Moretti Graphs, Maps, Trees: Abstract Models for a Literary History , 2005 .

[26]  Barbara R. Holland,et al.  Analysis of Phylogenetics and Evolution with R , 2007 .

[27]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[28]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[29]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[30]  C. Elkan,et al.  Topic Models , 2008 .

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[33]  Jacques Savoy,et al.  Lexical Analysis of US Political Speeches , 2010, J. Quant. Linguistics.