Information extraction for semi-structured documents

The number of unstructured or semi-structured documents produced in all types of organizations continues to increase rapidly. Cost-effective ways of finding the relevant ones and extracting useful information from them are increasingly important to a large number of enterprises for operational and decision-support applications. The approach discussed in this paper constitutes a suitable basis for building an effective solution to extracting information from semi-structured documents for two principal reasons. First, it provides an extensible architecture basis for: extracting structured information from semistructured documents; providing fast and accurate selective access to this information; performing selective dissemination of relevant documents depending on filtering criteria. Second, it is simple in terms of: the complexity of the algorithms used for structure recognition and document filtering; the number and size of data structures required to perform the three functions mentioned above; the amount and complexity of the metadata required to handle a given collection of documents. The work described here is part of the Dyade Médiation project, which aims to provide integrated software components for accessing heterogeneous data sources in Internet/Intranet environments.

[1]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[2]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[3]  Ellen Riloff,et al.  Automatically Acquiring Conceptual Patterns without an Annotated Corpus , 1995, VLC@ACL.

[4]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[5]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[6]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[7]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[8]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[9]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[10]  Patrick Valduriez,et al.  Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[11]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[12]  Gerard Salton,et al.  Automatic Text Theme Generation and the Analysis of Text Structure , 1994 .

[13]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[14]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.