Developing an Image-Based Classifier for Detecting Poetic Content in Historic Newspaper Collections

The Image Analysis for Archival Discovery (Aida) project team is investigating the use of image analysis to identify poetic content in historic newspapers. The project seeks both to augment the study of literary history by drawing attention to the magnitude of poetry published in newspapers and by making the poetry more readily available for study, as well as to advance work on the use of digital images in facilitating discovery in digital libraries and other digitized collections. We have recently completed the process of training our classifier for identifying poetic content, and as we prepare to move to the deployment stage, we are making available our methods for classification and testing in order to promote further research and discussion. The precision and recall values achieved during the training (90.58%; 79.4%) and testing (74.92%; 61.84%) stages are encouraging. In addition to discussing why such an approach is needed and relevant and situating our project alongside related work, this paper analyzes preliminary results, which support the feasibility and viability of our approach to detecting poetic content in historic newspaper collections.

[1]  David A. Smith,et al.  Infectious texts: Modeling text reuse in nineteenth-century newspapers , 2013, 2013 IEEE International Conference on Big Data.

[2]  Stefan Eickeler,et al.  Logical structure recognition for heterogeneous periodical collections , 2014, DATeCH '14.

[3]  Tzay Y. Young,et al.  Stochastic estimation of a mixture of normal density functions using an information criterion , 1970, IEEE Trans. Inf. Theory.

[4]  Thierry Paquet,et al.  Automatic article extraction in old newspapers digitized collections , 2014, DATeCH '14.

[5]  C. Lee Giles,et al.  Identifying table boundaries in digital documents via sparse line detection , 2008, CIKM '08.

[6]  Leen-Kiat Soh,et al.  A comprehensive, automated approach to determining sea ice thickness from SAR data , 1995, IEEE Trans. Geosci. Remote. Sens..

[7]  Sabine Süsstrunk,et al.  Binarization-free Text Line Extraction for Historical Manuscripts , 2014, DH.

[8]  Josep Lladós,et al.  An Interactive Appearance-based Document Retrieval System for Historical Newspapers , 2013, VISAPP.

[9]  Daniel McNamara,et al.  Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers , 2014, Digit. Humanit. Q..

[10]  Andrew Hobbs,et al.  How Local Newspapers Came to Dominate Victorian Poetry Publishing , 2014 .

[11]  Natalie M. Houston,et al.  VisualPage: Towards large scale analysis of nineteenth-century print culture , 2013, 2013 IEEE International Conference on Big Data.

[12]  Rangasami L. Kashyap,et al.  Estimation of probability density and distribution functions , 1968, IEEE Trans. Inf. Theory.

[13]  Jody L. DeRidder,et al.  What Do Researchers Need? Feedback On Use of Online Primary Source Materials , 2014, D Lib Mag..

[14]  Reinhold Huber-Mörk,et al.  An Image Based Approach for Content Analysis in Document Collections , 2013, ISVC.