论文信息 - Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Several digitization projects such as Google books are involved in scanning millions of books. The Biodiversity Heritage Digital Library (BHL http://www.bhl.si.edu/) plans to scan 1 million volumes of biodiversity literature over the next five years. However, the usefulness of the scanned images is limited because they can only be accessed through existing catalog information. Images can not be easily manipulated and transformed to useful information in full-text information systems. “Because of the very large amounts of data being generated, it is difficult to have human curators extract all these information and present them in a form useful to researchers. Information Extraction (IE) from such sources is becoming crucial for the timely dissemination of information.” (Subramaniam, 2003). Consequently, simple approaches that transform the text to structured format such as XML or relational databases will not be successful.

Qin Wei | P. Bryan Heidorn | P. Heidorn | Qin Wei

[1] Sougata Mukherjea,et al. Information extraction from biomedical literature: methodology, evaluation and an application , 2003, CIKM '03.

[2] David Robins,et al. Interactive Information Retrieval: Context and Basic Notions , 2000, Informing Sci. Int. J. an Emerg. Transdiscipl..

[3] Linda C. Smith,et al. Automating semantic markup of semi-structured text via an induced knowledge base: a case study using floras , 2005 .

[4] Sunita Sarawagi,et al. Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[5] James R. Curran,et al. Blueprint for a High Performance NLP Infrastructure , 2003, HLT-NAACL 2003.

[6] Nicholas J. Belkin,et al. A case for interaction: a study of interactive information retrieval behavior and effectiveness , 1996, CHI.

[7] Ian H. Witten,et al. Interactive machine learning: letting users build classifiers , 2002, Int. J. Hum. Comput. Stud..

[8] Pabitra Mitra,et al. Extracting semantic structure of web documents using content and visual information , 2005, WWW '05.