Pattern Matching and Discourse Processing in Information Extraction from Japanese Text

Information extraction is the task of automatically picking up information of interest from an unconstrained text. Information of interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scattered throughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple pattern search can achieve this purpose. The second step requires deeper knowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structured output format, complex discourse processing is essential. This paper reports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluation results show a high level of system performance which approaches human performance.

[1]  T. Kitani,et al.  Merging information by discourse processing for information extraction , 1994, Proceedings of the Tenth Conference on Artificial Intelligence for Applications.

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Shinichi Ando,et al.  NEC: description of the VENIEX system as used for MUC-5 , 1993, MUC.

[4]  E. Sumner Technology perspective , 1987, IEEE Network.

[5]  Alon Lavie,et al.  Recognizing substrings of LR(k) languages in linear time , 1994, TOPL.

[6]  Craig A. Will Comparing human and machine performance for natural language information extraction: results for English microelectronics from the MUC-5 evaluation , 1993, MUC.

[7]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[8]  Craig A. Will,et al.  Comparing Human and Machine Performance for Natural Language Information Extraction: Results from the Tipster Text Evaluation , 1993, TIPSTER.

[9]  Fernando Pereira,et al.  Finite-State Approximations of Grammars , 1990, HLT.

[10]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Natural-Language Text , 1992 .

[11]  Lisa F. Rau,et al.  GE-CMU: description of the SHOGUN system used for MUC-5 , 1993, MUC.

[12]  Beth Sundheim,et al.  A Performance Evaluation of Text-Analysis Technologies , 1991, AI Mag..

[13]  Rong Wang,et al.  CRL/Brandeis: description of the Diderot system as used for MUC-5 , 1993, MUC.

[14]  Lisa F. Rau,et al.  Creating segmented databases from free text for text retrieval , 1991, SIGIR '91.

[15]  Takahiro Wakao,et al.  Reference Resolution Using Semantic Patterns in Japanese Newspaper Articles , 1994, COLING.

[16]  Alon Lavie,et al.  Recognizing substrings of LR(k) languages in linear time , 1992, POPL '92.

[17]  Lynette Hirschman An adjunct test for discourse processing in MUC-4 , 1992, MUC.

[18]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Text , 1993, HLT.

[19]  Herbert Gish,et al.  BBN: Description of the PLUM System as Used for MUC-5 , 2005, MUC.

[20]  Beth Sundheim,et al.  MUC-5 Evaluation Metrics , 1993, MUC.

[21]  Claire Cardie,et al.  UMass/Hughes: Description of the CIRCUS System Used for MUC-51 , 1993, MUC.