论文信息 - Building a test collection for complex document information processing

Building a test collection for complex document information processing

Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.

[1] Heidi Schmidt,et al. Building Digital Tobacco Industry Document Libraries at the University of California, San Francisco Library/Center for Knowledge Management , 2002, D Lib Mag..

[2] Ellen M. Voorhees,et al. A complex document information processing prototype , 2006, SIGIR '06.

[3] Kazem Taghva,et al. The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[4] Norbert Hirschhorn. RESEARCH REPORTS AND PUBLICATIONS BASED ON TOBACCO INDUSTRY DOCUMENTS, 1995-2004 , 2004 .