Sequence Models for Automatic Highlighting and Surface Information Extraction

With the increase of textual information available electronically, we assist to a great diversification of the demands on Information Retrieval (IR) and Information Extraction (IE) systems. In this paper we apply Machine Learning techniques of sequence analysis to the tasks of highlighting and labeling text with respect to an information extraction task. Specifically, dynamic probability models are used. Like IR systems, they use little semantics, are fully trainable and do not require any knowledge representation of the domain. Unlike IR approaches, documents are considered as a dynamic sequence of words. Furthermore, additional word information is naturally included in the representation. Models are evaluated on a sub-task of the MUC6 Scenario Template corpus. When morpho-syntactic word information is introduced into the representation, an increase in performances is observed.

[1]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[2]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[3]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[4]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[5]  Patrick Gallinari,et al.  Coupled Hierarchical IR and Stochastic Models for Surface Information Extraction , 1998, BCS-IRSG Annual Colloquium on IR Research.

[6]  Robert J. Gaizauskas,et al.  On the Marriage of Information Retrieval and Information Extraction , 1997, BCS-IRSG Annual Colloquium on IR Research.

[7]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[8]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[9]  Peter Schäuble,et al.  Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System , 1995, TREC.

[10]  Erling B. Andersen,et al.  The Statistical Analysis of Categorical Data , 1990 .

[11]  Peter Schäuble,et al.  Document and passage retrieval based on hidden Markov models , 1994, SIGIR '94.

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  David E. Hapeman Statistical Analysis of Categorical Data , 2000, Technometrics.

[14]  EstimationBrian V. BonnlanderDepartment Selecting Input Variables Using Mutual Informationand Nonparametric Density , 1996 .