论文信息 - Information extraction for semi-structured documents

Information extraction for semi-structured documents

The number of unstructured or semi-structured documents produced in all types of organizations continues to increase rapidly. Cost-effective ways of finding the relevant ones and extracting useful information from them are increasingly important to a large number of enterprises for operational and decision-support applications. The approach discussed in this paper constitutes a suitable basis for building an effective solution to extracting information from semi-structured documents for two principal reasons. First, it provides an extensible architecture basis for: extracting structured information from semistructured documents; providing fast and accurate selective access to this information; performing selective dissemination of relevant documents depending on filtering criteria. Second, it is simple in terms of: the complexity of the algorithms used for structure recognition and document filtering; the number and size of data structures required to perform the three functions mentioned above; the amount and complexity of the metadata required to handle a given collection of documents. The work described here is part of the Dyade Médiation project, which aims to provide integrated software components for accessing heterogeneous data sources in Internet/Intranet environments.

Dan Smith | M Lopez

[1] Ralph Grishman,et al. Design of the MUC-6 evaluation , 1995, MUC.

[2] Joann J. Ordille,et al. Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[3] Ellen Riloff,et al. Automatically Acquiring Conceptual Patterns without an Annotated Corpus , 1995, VLC@ACL.

[4] Jennifer Widom,et al. The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[5] Ellen Riloff,et al. Information extraction as a basis for high-precision text classification , 1994, TOIS.

[6] James P. Callan,et al. Training algorithms for linear text classifiers , 1996, SIGIR '96.

[7] Serge Abiteboul,et al. Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[8] James Allan,et al. Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[9] Dan Suciu,et al. A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[10] Patrick Valduriez,et al. Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[11] Jennifer Widom,et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[12] Gerard Salton,et al. Automatic Text Theme Generation and the Analysis of Text Structure , 1994 .

[13] David Fisher,et al. CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[14] Alberto O. Mendelzon,et al. Querying the World Wide Web , 1997, International Journal on Digital Libraries.