A major focus of recent large scale digitisation initiatives has been historical texts, primarily in the form of out-of-copyright newspapers and books. However, the Optical Character Recognition (OCR) software used to translate the scanned images to machine-readable text does not provide satisfactory results for historical documents. This is due to issues inherent in the material such as warped pages, bleed-through, historical fonts, broken and irregular characters, complex layouts, and spelling variants.
In the large scale project Improving Access to Text (IMPACT), a European team of scientists, industry partners and digitisation professionals have been working together to enhance existing and develop new approaches to the extraction of text content from historical documents. The project facilitates a successful collaboration between digitisation professionals, based at institutions digitising millions of historical text documents, and scientists in document analysis, language technologies and OCR.
This session will detail the work of IMPACT in the context of real life problems faced in the large scale digitisation programmes of libraries and the legacy that the project will leave to foster further research in advancing the state of the art in extracting textual content from historical documents.
[1]
Apostolos Antonacopoulos,et al.
The PAGE (Page Analysis and Ground-Truth Elements) Format Framework
,
2010,
2010 20th International Conference on Pattern Recognition.
[2]
Hildelies Balk.
Poor access to digitised historical texts: the solutions of the IMPACT project
,
2009,
AND '09.
[3]
Ioannis Pratikakis,et al.
A word spotting framework for historical machine-printed documents
,
2010,
International Journal on Document Analysis and Recognition (IJDAR).
[4]
Vladimir Kluzner,et al.
Word-Based Adaptive OCR for Historical Books
,
2009,
2009 10th International Conference on Document Analysis and Recognition.
[5]
Hildelies Balk,et al.
IMPACT: working together to address the challenges involving mass digitization of historical printed text
,
2009,
OCLC Syst. Serv..
[6]
Asaf Tzadok,et al.
User Collaboration for Improving Access to Historical Texts
,
2010
.
[7]
Simon Tanner,et al.
Measuring Mass Text Digitization Quality and Usefulness
,
2009
.