论文信息 - Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features

Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features

Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clues which can be useful for improving the accuracy of the categorization. Here, we describe observations of several historical newspapers to determine the characteristics of sections. We then explore how to automatically identify those sections and how to detect serialized feature articles which are repeated across days and weeks. The goal is not the introduction of new algorithms but the development of practical and robust techniques. For both analyses we find substantial success for some categories and articles, but others prove very difficult.

Robert B. Allen | Catherine Hall | R. Allen | Catherine Hall

[1] Carina Ihlström Eriksson,et al. Genre characteristics - a front page analysis of 85 Swedish online newspapers , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[2] Ray L. Murray. Toward a metadata standard for digitized historical newspapers , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3] Robert B. Allen,et al. Exploring History with Narrative Timelines , 2009, HCI.

[4] Rose Holley,et al. How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[5] Ray Siemens,et al. Mind Technologies: Humanities Computing And the Canadian Academic Community , 2006 .

[6] George Buchanan,et al. Digital Libraries: Universal and Ubiquitous Access to Information, 11th International Conference on Asian Digital Libraries, ICADL 2008, Bali, Indonesia, December 2-5, 2008. Proceedings , 2008, ICADL.

[7] Robert B. Allen,et al. Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres , 2008, ICADL.