Quality issues in thesaurus building: a case study from the medical domain

To ensure the quality of a medical thesaurus is a non-trivial task, due to the inherent complexity of medical terminology. The peculiarities of the medical sublanguage and the subjectivism of lexicographers' choices complicate the thesaurus construction process. Our experience is based on the MorphoSaurus lexicon, the basis of a biomedical cross-language indexing and retrieval system. We describe two complementary maintenance approaches, viz. i) corpus-based error detection, and ii) thesaurus anomaly detection. These techniques were developed to detect so-called dynamic and static errors, which are committed by the lexicographers during the construction and maintenance process. Considering multilingual parallel corpora, the distribution of semantic identifiers should be similar whenever comparing related texts in different languages. In the first approach, those semantic identifiers are identified that exhibit greatest frequency variations when comparing text pairs. A manual review of these search results is supposed to spot content errors, which are subsequently classified and fixed by the lexicographers. The second approach analyses transaction-based anomalies, which are identified by interpreting the log of lexicographers' actions during thesaurus maintenance. This methodology highlights the four most common types of this kind of anomaly and evaluates the effectiveness of the corpus-based detection techniques. The overall quality improvement of the thesaurus was evaluated using the OHSUMED IR benchmark.

[1]  Jean Véronis,et al.  Parallel text processing :alignment and use of translationcorpora , 2000 .

[2]  Stefan Schulz,et al.  Interlingual Indexing across Different Languages , 2004, RIAO.

[3]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[4]  Stefan Schulz,et al.  Text retrieval based on medical subwords. , 2002, Studies in health technology and informatics.

[5]  Christian Nøhr,et al.  Comparing Approaches to Measuring the Adoption and Usability of Electronic Health Records: Lessons Learned from Canada, Denmark and Finland , 2013, MedInfo.

[6]  Stefan Schulz,et al.  Morpheme-based, cross-lingual indexing for medical document retrieval , 2000, Int. J. Medical Informatics.

[7]  Anita Sundaram,et al.  Information Retrieval: A Health Care Perspective , 1996 .

[8]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[9]  Lars Borin,et al.  Towards a Multilingual Medical Lexicon , 2006, AMIA.

[10]  Fredric C. Gey,et al.  ENSM-SE at CLEF 2006 : Fuzzy Proximity Method with an Adhoc Influence Function in Evaluation of Multilingual and Multi-modal Information Retrieval 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain , 2007 .

[11]  Stefan Schulz,et al.  Unsupervised Multilingual Word Sense Disambiguation via an Interlingua , 2005, AAAI.

[12]  Pascale Fung,et al.  A statistical view on bilingual lexicon extraction , 1998, AMTA.

[13]  S. Schulz,et al.  Survey of current terminologies and ontologies in biology and medicine , 2009 .

[14]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[15]  Stefan Schulz,et al.  Cognate Mapping - A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon , 2004, COLING.

[16]  U Hahn,et al.  MorphoSaurus , 2005, Methods of Information in Medicine.

[17]  Michael Quinn Patton,et al.  Debates on Evaluation , 1990 .

[18]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[19]  Fredric C. Gey Research to Improve Cross-Language Retrieval - Position Paper for CLEF , 2000, CLEF.

[20]  RetrievalDouglas W. OardCollege Alternative Approaches for Cross-Language Text Retrieval , 1997 .

[21]  Carol Peters What Happened in CLEF 2006 , 2006, CLEF.