Document analysis at DFKI. - Part 2: Information extraction

Document analysis is responsible for an essential progress in office automation. This paper is part of an overview about the combined research efforts in document analysis at DFKI. Common to all document analysis projects is the global goal of providing a high level electronic representation of documents in terms of iconic, structural, textual, and semantic information. These symbolic document descriptions enable an “intelligent” access to a document database. Currently there are three ongoing document analysis projects at DFKI: INCA, OMEGA, and PASCAL2000/PASCAL+. Although the projects pursue different goals in different application domains, they all share the same problems which have to be resolved with similar techniques. For that reason the activities in these projects are bundled to avoid redundant work. At DFKI we have divided the problem of document analysis into two main tasks, text recognition and information extraction, which themselves are divided into a set of subtasks. In a series of three research reports the work of the document analysis and office automation department at DFKI is presented. The first report discusses the problem of text recognition, the second that of information extraction. In a third report we describe our concept for a specialized knowledge representation language for document analysis. The report in hand describes the activities dealing with the information extraction task. Information extraction covers the phases text analysis, message type identification and file integration.

[1]  Renato De Mori,et al.  Some Results on Stochastic Language Modelling , 1991, HLT.

[2]  Jonathan J. Hull,et al.  A hidden Markov model for language syntax in text recognition , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[3]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[4]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[5]  John D. Lafferty,et al.  Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammars , 1991, Comput. Linguistics.

[6]  Michael Lebowitz,et al.  Memory-Based Parsing , 1983, Artif. Intell..

[7]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[8]  R.J.N. Kalberg,et al.  Automatic interpretation of Dutch addresses , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[9]  Richard M. Schwartz,et al.  POST: Using Probabilities in Language Processing , 1991, IJCAI.

[10]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[11]  Fred Kochman,et al.  Calculating the Probability of a Partial Parse of a Sentence , 1991, HLT.

[12]  Lindsay J. Evett,et al.  Semantic analysis for large vocabulary cursive script recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[13]  J. Simon,et al.  From Pixels to Features III: Frontiers in Handwriting Recognition , 1992 .

[14]  Kazem Taghva,et al.  The effects of noisy data on text retrieval , 1994 .

[15]  A. Konno,et al.  Postprocessing algorithm based on the probabilistic and semantic method for Japanese OCR , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[16]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[17]  Emmon W. Bach,et al.  Universals in Linguistic Theory , 1970 .

[18]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[19]  Michael J. Chen,et al.  High-speed inspection architectures, barcoding, and character recognition : 5-7 November 1990, Boston, Massachusetts , 1991 .

[20]  Dirk Wodtke,et al.  Mentor: Entwurf einer Workflow-Management-Umgebung basierend auf State- und Activitycharts , 1995, Datenbank Rundbr..

[21]  Claire Cardie,et al.  A Cognitively Plausible Approach to Understanding Complex Syntax , 1991, AAAI.

[22]  Achim Weigel,et al.  Document analysis at DFKI. - Part 1: Image analysis and text recognition , 1995 .

[23]  Tao Hong,et al.  Text recognition enhancement with a probabilistic lattice chart parser , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[24]  Sargur N. Srihari From pixels to paragraphs: The use of contextual models in text recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[25]  Kimmo Koskenniemi,et al.  Two-Level Model for Morphological Analysis , 1983, IJCAI.

[26]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[27]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[28]  Wendy G. Lehnert,et al.  Strategies for Natural Language Processing , 1982 .

[29]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[30]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  Knut Hinkelmann,et al.  Context-sensitive office tasks a generative approach , 1992, Decis. Support Syst..

[33]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[34]  Rainer Hoch,et al.  Using IR techniques for text classification in document analysis , 1994, SIGIR '94.

[35]  L. F. Rau,et al.  Extracting company names from text , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.

[36]  Koichi Kise,et al.  Improvement of Text Image Recognition Based on Linguistic Constraints , 1992, MVA.

[37]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[38]  William A. Woods Optimal Search Strategies for Speech Understanding Control , 1982, Artif. Intell..

[39]  Rohini K. Srihari,et al.  Incorporating Syntactic Constraints in Recognizing Handwritten Sentences , 1993, IJCAI.

[40]  Jonathan J. Hull Incorporation of a Markov model of language syntax in a text recognition algorithm , 1995 .

[41]  Julian Kupiec A Trellis-Based Algorithm For Estimating The Parameters Of Hidden Stochastic Context-Free Grammar , 1991, HLT.