Autotag: A Tool for Creating Structured Document Collections from Printed Materials

We report on the design and implementation of a system which automates the process of capturing structured documents from the optically recognized form of printed materials. The system is intended to be used to convert printed collections into their corresponding tagged electronic versions with little or no manual interventon. This conversion process has some unique problems associated with it, these are discussed, along with our attempts to solve them. This system also establishes a mapping between the bitmap image and its corresponding ASCII representation that can be used to design flexible image-based interfaces for IR-related applications.

[1]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[2]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[3]  Rolf Ingold,et al.  Structure recognition of printed documents , 1988 .

[4]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[5]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[6]  Karen Spärck Jones Search Term Relevance Weighting given Little Relevance Information , 1997, J. Documentation.

[7]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[8]  Ian A. Macleod A Query Language for Retrieving Information from Hierarchic Text Structures , 1991, Comput. J..

[9]  Ian A. Macleod,et al.  Storage and retrieval of structured documents , 1990, Inf. Process. Manag..

[10]  Udo Hahn,et al.  Topic parsing: Accounting for text macro structures in full-text analysis , 1990, Inf. Process. Manag..

[11]  Gerard Salton,et al.  Automatic Routing and Retrieval Using Smart: TREC-2 , 1995, Inf. Process. Manag..

[12]  Wolfgang Appelt,et al.  The Formal Specification of the ISO Open Document Architecture (ODA) Standard , 1993, Comput. J..

[13]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[14]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.

[15]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[16]  A. Lawrence Spitz Style-Directed Document Recognition , 1999 .

[17]  Scott C. Deerwester,et al.  A textual object management system , 1992, SIGIR '92.

[18]  Kazem Taghva,et al.  MANICURE document processing system , 1998, Electronic Imaging.

[19]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[20]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[21]  Steven J. DeRose,et al.  Markup systems and the future of scholarly text processing , 1987, CACM.

[22]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[23]  Kazem Taghva,et al.  Post-Editing Through Approximation and Global Correction , 1995, Int. J. Pattern Recognit. Artif. Intell..

[24]  Kui-Lam Kwok The use of title and cited titles as document representation for automatic classification , 1975, Inf. Process. Manag..

[25]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[26]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[27]  W. Bruce Croft,et al.  Text retrieval and inference , 1992 .

[28]  Karen Spärck Jones Experiments in relevance weighting of search terms , 1979, Inf. Process. Manag..

[29]  George Nagy,et al.  Optical Scanning Digitizers , 1983, Computer.

[30]  A. Lawrence Spitz,et al.  Automatic recognition and representation of documents , 1988 .

[31]  Richard Southall Visual structure and the transmission of meaning , 1988 .

[32]  Ronald J. Vetter,et al.  Mosaic and the World Wide Web , 1994, Computer.

[33]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[34]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[35]  James Allan,et al.  Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 , 1993, TREC.

[36]  Sargur N. Srihari Document Image Understanding , 1986, FJCC.

[37]  Julie Borsack,et al.  Evaluation of an automatic markup system , 1995, Electronic Imaging.

[38]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[39]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.