Exploring the use of natural language systems for fact identification: Towards the automatic construction of healthcare portals

In prior work we observed that expert searchers follow well-defined search procedures in order to obtain comprehensive information on the Web. Motivated by that observation, we developed a prototype domain portal called the Strategy Hub that provides expert search procedures to benefit novice searchers. The search procedures in the prototype were entirely handcrafted by search experts, making further expansion of the Strategy Hub cost-prohibitive. However, a recent study on the distribution of healthcare information on the web suggested that search procedures can be automatically generated from pages that have been rated based on the extent to which they cover facts relevant to a topic. This paper presents the results of experiments designed to automate the process of rating the extent to which a page covers relevant facts. To automatically generate these ratings, we used two natural language systems, Latent Semantic Analysis and MEAD, to compute the similarity between sentences on the page and each fact. We then used an algorithm to convert these similarity scores to a single rating that represents the extent to which the page covered each fact. These automatic ratings are compared with manual ratings using inter-rater reliability statistics. Analysis of these statistics reveals the strengths and weaknesses of each tool, and suggests avenues for improvement.

[1]  Zhu Zhang,et al.  NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization , 2001, HLT.

[2]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[3]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[4]  Susannah Fox,et al.  Internet Health Resources , 2003 .

[5]  Suresh K. Bhavnani,et al.  Strategy hubs: next-generation domain portals with search procedures , 2003, CHI '03.

[6]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[7]  Marilyn Hughes Blackmon,et al.  Cognitive walkthrough for the web , 2002, CHI.

[8]  H. Christensen,et al.  Quality of web based information on treatment of depression: cross sectional survey , 2000, BMJ : British Medical Journal.

[9]  Peter W. Foltz,et al.  The intelligent essay assessor: Applications to educational technology , 1999 .

[10]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[11]  Suresh K. Bhavnani,et al.  Important Cognitive Components of Domain-Specific Search Knowledge , 2001, TREC.

[12]  J Sybil Biermann,et al.  Melanoma information on the Internet: often incomplete--a public health opportunity? , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[13]  J. Biermann,et al.  Evaluation of cancer information on the Internet , 1999, Cancer.

[14]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[15]  Thomas K. Landauer,et al.  On the computational basis of learning and cognition: Arguments from LSA , 2002 .

[16]  Dragomir R. Radev,et al.  Experiments in Single and Multi-Document Summarization Using MEAD , 2001 .

[17]  Marilyn Hughes Blackmon,et al.  Repairing usability problems identified by the cognitive walkthrough for the web , 2003, CHI '03.

[18]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[19]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.