Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Several digitization projects such as Google books are involved in scanning millions of books. The Biodiversity Heritage Digital Library (BHL http://www.bhl.si.edu/) plans to scan 1 million volumes of biodiversity literature over the next five years. However, the usefulness of the scanned images is limited because they can only be accessed through existing catalog information. Images can not be easily manipulated and transformed to useful information in full-text information systems. “Because of the very large amounts of data being generated, it is difficult to have human curators extract all these information and present them in a form useful to researchers. Information Extraction (IE) from such sources is becoming crucial for the timely dissemination of information.” (Subramaniam, 2003). Consequently, simple approaches that transform the text to structured format such as XML or relational databases will not be successful.