Using Weakly Structured Documents to Fill in a Classical Database

Electronic documents have become a universal way of communication due to Web expansion. But using structured information stored in databases is still essential for data coherence management, querying facilities... We thus face a classical problem known as "impedance mismatch" in the database world: two antagonist approaches have to collaborate. Using documents at the end-user interface level provides simplicity and flexibility. But it is possible to take documents as data sources only if helped by a human being: automatic documents analysis systems have a significant error rate. Databases are an alternative as semantics and format of information are strict: queries via SQL provide 100% correct responses. The aim of this work is to provide a system that associates document capture freedom with database storage structure. The system we propose does not intend to be universal. It can be used in specific cases where people usually work with technical documents dedicated to a particular domain. Our examples concern medicine and more explicitly medical records. Computerization has very often been rejected by physicians because it necessitates too much standardization, and form-based user interfaces are not adapted to their daily practice. In this domain, we think that this study provides a viable alternative approach. This system offers freedom to doctors: they would fill in documents with the information they want to store, in a convenient order and in a more free way. We have developed a system that allows users to fill in a database quasi automatically from document paragraphs. The database used is an already existing database, that can be queried in a classical way for statistical studies or epidemiological purposes. In this system, the document fund and the database containing extractions from documents coexist. Queries are sent to the database, answers include data from the database and references to source documents.

[1]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[2]  Lois M. L. Delcambre,et al.  Structured Maps: modeling explicit semantics over a universe of information , 1996, International Journal on Digital Libraries.

[3]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[4]  Heikki Mannila,et al.  Retrieval from hierarchical texts by partial patterns , 1993, SIGIR.

[5]  Serge Abiteboul,et al.  From structured documents to novel query facilities , 1994, SIGMOD '94.

[6]  Michael Stonebraker,et al.  Document processing in a relational database system , 1983, TOIS.

[7]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[8]  Erich J. Neuhold,et al.  Structured document storage and refined declarative and navigational access mechanisms in HyperStorM , 1997, The VLDB Journal.

[9]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[10]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[11]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[12]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[13]  Thomas Schwentick,et al.  Expressive and efficient pattern languages for tree-structured data (extended abstract) , 2000, PODS '00.

[14]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[15]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[16]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[17]  Pierre Zweigenbaum,et al.  A Lexical Method for Assisted Extraction and Coding of ICD-10 Diagnoses from Free Text Patient Discharge Summaries , 1999, AMIA.

[18]  ShmueliOded,et al.  Information gathering in the World-Wide Web , 1998 .

[19]  Peter Jackson,et al.  Information extraction from case law and retrieval of prior cases by partial parsing and query generation , 1998, CIKM '98.

[20]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[21]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[22]  Farshad Riahi Elaboration automatique d'une base de données à partir d'informations semi-structurées issues du Web , 1998, INFORSID.

[23]  Frédérique Laforest,et al.  A Model for Querying Annotated Documents , 1999, ADBIS.

[24]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[25]  Mark A. Stairmand Textual context analysis for information retrieval , 1997, SIGIR '97.

[26]  Mary Fernandez XML Query Languages: Experiences and Exemplars , 2001 .