The development of a MeSH-based biomedical termbase at Hogeschool Gent

This paper reports on an ongoing long-term project to build an English-and-Dutch termbase using the MeSH terms (Medical Subject Headings) as input. Although from the start NLP applications had been envisaged, the database has mainly been built according to the traditional principles of terminology management for human translation. With important parts of the project now nearing completion, the question arises whether and how the material could be made available in a traditional dictionary format as well as in a format that can be used in language technology applications. It is argued that the traditional detailed working method used, based on explicit evidence and recording a wealth of information on synonyms, variants, usage and reliability, can also be profitable to NLP applications. It is unlikely, however, that a single format can be found to make the data available for all possible purposes. Rather, the current database will have to act as a common repository from which various extractions can be made, through conversion, for different applications. To facilitate conversions, it would be expedient for future projects to work towards a uniform standard from the start. It is speculated that TermBase eXchange may be the most promising emerging standard at the moment. 1. Existing Medical Glossaries with Dutch Existing English-and-Dutch medical dictionaries are limited in scope, definitely when one confronts them with the vast wealth of medical terms found in thesauri like the Medical Subject Headings (MeSH, http://www.nlm.nih.gov/mesh/). Among the bilingual sources we may mention two dictionaries in paper form, Kerkhof (2003) and Mostert (2002), both of them slim volumes uniting both language directions. Online lists like Taalvlinder (http://www.ochrid.dds.nl/medici.htm) and Woordenboek Ziekenhuistermen (via http://www.ziekenhuis.nl) are very deserving but also limited in their number of entries as well as in the information provided. An important multilingual list in which Dutch is also represented is the Multilingual glossary of technical and popular medical terms in nine European languages at http://allserv.rug.ac.be/~rvdstich/eugloss/welcome.html, developed at our college in co-operation with the Heymans Instituut voor Farmacologie. Yet, here too, a term like orthopaedic will obviously be found but a more technical item like orthomolecular will be absent. 2. A Bilingual Termbase Project An obvious and undoubtedly rewarding way to increase the scope of a medical glossary is to take input from a detailed medical thesaurus like the MeSH. This idea was suggested to us by R. Vander Stichelen of the Heymans Instituut in 1987. His first suggestion was to provide Dutch equivalents for the MeSH subject headings so that, for example, the Dutch headings could be used to search the Index Medicus; or so that the Dutch as well as English headings could be used for indexing medical publications co-sponsored by his Institute. (On the topic of CrossLanguage Information Retrieval see also Peter Schaüble et al. and references there.) By suggesting the idea to our School of Translation Studies (Hogeschool Gent), however, he had awakened another interest, viz. the development of a full-scale medical dictionary. This was to take the project beyond such applications like indexing, document retrieval and natural language processing (NLP) in general, to also make it useful for human translators dealing with a variety of medical texts. As will be indicated below, the interests of the NLP-specialists and traditional translators/terminologists do not always coincide but the confrontation of two parties can be wholesome. Lack of adequate funding for the project meant that it was cut up in a large number of thesis subjects (over 130 to date). Students are each assigned a subchapter from the MeSH, so that they can concentrate on a specialist subject area. They liaise with one or more specialists of that subject area, preferably staff in the University Hospital, and they fill in (very) detailed records on each concept studied. Research involves primary texts as well as reference works and informants. Work has been slow moving but thorough. The MeSH chapter on diseases has now been covered for 90% and the chapter on medical procedures and techniques is also nearing completion. Large sections of other chapters have also been dealt with but some need revision. In the last couple of years, work has started on adding French equivalents using the same detailed record, but here too progress is slow. There are now plans to publish specific parts of the Dutch-and-English material, possibly on CD-ROM or on a protected website, and the project leaders are faced with a choice between a more traditional dictionary format that would undoubtedly be hailed by the human medical translators, or a machine readable format that would be welcomed by human language technologists or both. There can be no doubt that the way in which the material has been developed has been more slanted towards the traditional dictionary approach; yet it is believed to be sufficiently structured to allow conversion to an NLP-type glossary. 3. NLP versus traditional terminology As suggested earlier, cross-fertilization of terminology work for NLP on the one hand and traditional terminography on the other stands to benefit both parties. NLP adepts are typically interested in one-to-one term lists in machine readable form; whereas traditionalist terminologists tend to favour detailed records for each concept. One-to-one conversions of the MeSH-thesaurus have been created for several languages (cf. http://www.nlm.nih.gov/research/umls/sources_by_catego ries.html). Some can be consulted via HONselect 39 In Pierre Zweigenbaum, Stefan Schulz, and Patrick Ruch, editors, LREC 2006 Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: the Case of Biomedicine. Genova, Italy, 2006. ELDA. (http://debussy.hon.ch/cgi-bin/HONselect?search). A (partial) Dutch version commissioned by the Nederlands Tijdschrift voor Geneeskunde, codenamed MSHDUT2004, is obtainable for research purposes (though not for commercial purposes) . Yet traditional terminologists have been quick to point out errors in the existing translations and have claimed that they are "rough and ready" conversions only. While this claim is awaiting substantiation (i.e. via a detailed review), it is true that the translation of extensive lists like the MeSH headings, spanning several specialisms, is a very time-consuming task (if it is to be done well) so that the fast creation of equivalent lists is at least suspicious. There are also other aspects that traditionalists are likely to frown upon; but also aspects that they tend to ignore and that the NLP supporters are much better at. Examples of either category are explored and illustrated below. 3.1. The issue of evidence The creation of one-to-one lists relieves the makers of the arduous task of giving evidence. Traditional terminologists like to quote their sources in evidence; the term is given in one or more original fragments of text ("contexts"), with a detailed reference to the source. Sometimes the reference is to an informant. These details are often absent from a machine readable glossary. While this is understandable, it should be a matter of principle that even when machine readable lists do not give quoted examples or other evidence, the lists should somehow be backed up by a database that does give these data. 3.2. The issue of synonyms Machine readable glossaries prefer to believe in the fiction that technical vocabularies have one term for one concept. While this is the ideal situation in a normative approach (and was also the situation envisaged by the founding father of terminology, Eugen Wüster), it definitely does not hold true of medical terminology. Monolingual medical dictionaries of English illustrate that the same concept is often referred to by a whole series of synonyms. The treatment of a patient with drugs, for example, can alternatively be termed drug treatment, pharamacotherapy, pharmacologic therapy, pharmacological treatment or medication therapy. The International dictionary of medicine and biology (Landau et al., 1986), in particular, has a habit of quoting many alternatives. While some of these may be related terms rather than true synonyms (and while it is wise also in other respects to make a distinction between "true synonyms" and "near synonyms" / "extra synonyms", cf. 3.4 below) , it remains undeniably true that the use of alternative names is common in medicine. Where terminologies are used for indexing, there is a feeling that synonyms should be disregarded and that preference must be given to a favoured term (the normative approach). The human translator, however, knows that each of the alternative terms may crop up in a text so s/he is interested in having them all recorded in the termbase. Yet even for NLP purposes, it is interesting not only to establish reference terms but also to link them up with synonyms (or even cognate words). This is already done in document retrieval. Here too, the detailed groundwork that traditional terminologists are apt to do, can also be relevant for the machine readable derivations. 3.3. The issue of usage Dutch medical language, more so than English, has variants that can be termed either "technical" or "popular". The former terms (nausea) would be favoured in the scholarly literature, the latter (misselijkheid) would be used in the communication with patients and are therefore also eligible for use in patient information leaflets. In fact, the need for popular equivalents that could be used to make information leaflets more readable prompted the European Commission to sponsor the Multilingual glossary referred to above. (In the US, patient information does not enjoy the same status, mainly because of legal concerns; cf. Vander Stichele, 2004, 13ff.). Again, a translator would want to