COMPARA - Optical Character Recognition (OCR) editing and preliminary markup instructions

For a text to become searchable in an electronic corpus, it has to be in digital form. If the text is not available in digital form, it has to be either typed into a computer or scanned and submitted to an OCR program. The scanner works like a photocopier, i.e., it takes a photograph of the text. An OCR program transforms the text that has been photographed by the scanner into text that you can work with using a word processor. After a text has been through an OCR program, you can add or delete words, change fonts, etc. The first stage of preparing texts for COMPARA described here involves (1) correcting OCR problems, (2) removing parts of the text that are not needed, and (3) introducing a number of tags that are necessary. All three things can be done at the same time, as you go over the digital text on your computer screen with the printed text near at hand.