Clustering-Based Article Identification in Historical Newspapers

This article focuses on the problem of identifying articles and recovering their text from within and across newspaper pages when OCR just delivers one text file per page. We frame the task as a segmentation plus clustering step. Our results on a sample of 1912 New York Tribune magazine shows that performing the clustering based on similarities computed with word embeddings outperforms a similarity measure based on character n-grams and words. Furthermore, the automatic segmentation based on the text results in low scores, due to the low quality of some OCRed documents.

[1]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[2]  Benjamin Bruno Meier,et al.  Fully Convolutional Neural Networks for Newspaper Article Segmentation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[4]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[5]  David A. Smith,et al.  Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers , 2015 .

[6]  Alexander A. Alemi,et al.  Text Segmentation based on Semantic Word Embeddings , 2015, ArXiv.

[7]  W. Slauter The Rise of the Newspaper , 2015 .

[8]  Leen-Kiat Soh,et al.  Developing an Image-Based Classifier for Detecting Poetic Content in Historic Newspaper Collections , 2015, D Lib Mag..

[9]  Thierry Paquet,et al.  Automatic article extraction in old newspapers digitized collections , 2014, DATeCH '14.

[10]  Chris Biemann,et al.  Text Segmentation with Topic Models , 2012, Journal for Language Technology and Computational Linguistics.

[11]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[12]  H. Tibbo Primarily History in America: How U.S. Historians Search for Primary Materials at the Dawn of the Digital Age , 2007 .

[13]  Yves Bestgen,et al.  Squibs and Discussions: Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001) , 2006, CL.

[14]  Yves Bestgen,et al.  Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore , 2006, Computational Linguistics.

[15]  Joan M. Cherry,et al.  Finding and Using Archival Resources: A Cross-Canada Survey of HistoriansStudying Canadian History , 2004 .

[16]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[17]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[18]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[19]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[20]  Michael Lund America's Continuing Story: An Introduction to Serial Fiction, 1850-1900 , 1992 .

[21]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .