Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.

[1]  Kevin R. Glass,et al.  A naïve, salience-based method for speaker identification in fiction books , 2007 .

[2]  Rok Sosic,et al.  NIFTY: a system for large scale information flow tracking and clustering , 2013, WWW.

[3]  Bruce W. Watson,et al.  Incremental construction of minimal acyclic finite state automata , 2000, CL.

[4]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[5]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[6]  Jesse M. Shapiro,et al.  Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech , 2016 .

[7]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[8]  James R. Curran,et al.  A Sequence Labelling Approach to Quote Attribution , 2012, EMNLP.

[9]  David García,et al.  It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.

[10]  DaciukJan,et al.  Incremental construction of minimal acyclic finite-state automata , 2000 .

[11]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[12]  Angel X. Chang,et al.  A Two-stage Sieve Approach for Quote Attribution , 2017, EACL.

[13]  Steinberger Ralf,et al.  Automatic Detection of Quotations in Multilingual News , 2007 .

[14]  Jure Leskovec,et al.  QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns , 2015, WWW.

[15]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[16]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[17]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[18]  André F. T. Martins,et al.  A Joint Model for Quotation Attribution and Coreference Resolution , 2014, EACL.

[19]  Kathleen McKeown,et al.  Automatic Attribution of Quoted Speech in Literary Narrative , 2010, AAAI.