Multilingual Thesauri in Cross-Language Text and Speech Retrieval

This paper sets forth a framework for the use of thesauri as knowledge bases in cross-language retrieval. It provides a general introduction to thesaurus functions, structure, and construction with particular attention to the problems of multilingual thesauri. 1 Thesaurus functions A thesaurus is a structure that manages the complexities of terminology in language and provides conceptual relationships. A thesaurus for assisting writers suggests from a semantic field the term that best conveys the intended meaning and connotation. In information retrieval a thesaurus or classification can be used in two ways: (1) controlled vocabulary indexing and searching (applicable to any kind of retrieval); (2) knowledge-based support of free-text searching (applicable only to written or spoken text although the text could point to another object, e.g. retrieving images through a free-text search of written or spoken image captions or through a search of the text portion of a movie). To look at thesaurus functions more generally, we first observe that a thesaurus is a knowledge base on concepts and terminology; other such knowledge bases are dictionaries and ontologies developed for AI applications, linguistic systems, or data element definition. Since these different types of knowledge bases — though developed for different purposes — overlap greatly, it would be best to integrate them through a common access system (Soergel 1996). The functions to be served by such a virtual integrated knowledge base of concepts and terminology are listed in Table 1. One of these functions, user-centered or request-oriented indexing, is central in information retrieval and deserves further explanation. It involves constructing a thesaurus based on the conceptual analysis of actual user queries and interests and constructing a framework that includes the concepts of interest to the users and thus communicates these interests to the indexers or a · Provide a semantic road map to individual fields and the relationships among fields; relate concepts to terms, and provide definitions, thus providing orientation and serving as a reference tool. · Improve communication and learning generally: Assist writers and readers, support learning through providing conceptual frameworks, support language learning and the development of instructional materials. · Provide the conceptual basis for the design of good research and implementation. Assist researchers and practitioners in exploring the conceptual context of a research project, policy, plan, or implementation project and instructuring the problem. Consistent definition of variables and measures for more comparable and cumulative research and evaluation results. · Provide classification for action: a classification of diseases for diagnosis, of medical procedures for insurance billing, of commodities for customs. · Support information retrieval, including knowledgebased support of end-user searching (menu trees, guided facet analysis of a search topic, browsing a hierarchy to identify search concepts, mapping from the user’s query terms to descriptors used in one or more databases or to the multiple natural language expressions to be used for free-text searching), hierarchically expanded searching, support ofwell-structured displays of search results, providing atool for indexing (vocabulary control, user-centered indexing). · Conceptual basis for knowledge-based systems. · Do all this across multiple languages · Mono-, bi-, or multilingual dictionary for human use. Dictionary/knowledge base for automated language processing machine translation and natural language understanding (data extraction, automatic abstracting/indexing). Fig. 1. Functions of a knowledge base of concepts and terminology sophisticated machine indexing system. The human indexers can then become the "eyes and ears" of the users and index materials from the users’ perspective. The thesaurus includes the concepts for which the users want to search, the indexers use this structured list of concepts as a checklist, applying their understanding of a document (or other object) to judge the relevance of a document to any of these concepts, thus making sure that a user who in a search for a given concept would find a given document relevant will indeed find that document. A document can be relevant for a concept without being about the concept. For example, a document titled The percentage of children of blue-collar workers going to collegeis not necessarily about intergenerational social mobility, but a researcher interested in that topic would sure like to find it, so it is relevant. Other example: Users are interested in biochemical basis of behavior and also in quickly retrieving alllongitudinal studies . So these descriptors are in the thesaurus. The indexer examines the document CSF studies on alcoholism and related behaviorsand finds that it is relevant to both descriptors. (Longitudinal is not mentioned in the document and it took careful examination of the methods section to make that determination.) An expert system for indexing could, to a degree, draw the inferences needed for usercentered indexing. Is that kind of indexing expensive? Yes, unless it can (to a degree) be automated through a knowledge-based system for automated indexing. Is it worthwhile? Few empirical studies evaluating user-centered (as opposed to the commonly used document-centered) indexing exist; they show a positive effect on retrieval performance. The worth derived from improved performance depends on use of the retrieval results. As we shall see, this perspective on indexing has implications for cross-language retrieval: The conceptual framework must be communicated in every participating language to allow a meeting of minds to take place regardless of the languages of the user and the indexer. Note on terminology: (1) Cross-language retrieval is the retrieval of any type of object composed or indexed in one language with a query formulated in another language. A cross-language retrieval system may have any number of source languages in which queries can be formulated (from 1 to many) and any number of target languages in which objects can be composed or indexed. (2) While text retrievalhas come to mean retrieval of written text, andspeech retrieval retrieval of spoken text, in linguistics text can be written or spoken, and we will usetext in this broad meaning. (3) A thesaurus that exists in more than one language is called a multilingual thesaurus . 2 Thesaurus structure Knowledge of thesaurus structure is a prerequisite for understanding the implementation of thesaurus functions in retrieval. Thus we deal with it first. Section 2.1 reviews principles applicable to any thesaurus and Section 2.2 issues specific to multilingual thesauri. 2.1 Brief review of thesaurus structure principles 2.1.1 Terminological structure The terminological structure establishes synonym relationships between terms and disambiguates homonyms, as illustrated in the following examples. Terms from another language with the same meaning are synonyms in a broad sense. Controlling synonyms Term Preferred synonym Alcoholism Alcohol dependence Inheritance Heredity Ultrasonic cardiography Echocardiography Black African American Afro-American African American Pregnant adolescent Pregnant teen Disambiguating homonyms Discharge 1 (From hospital or program) Discharge 2 (From organization or employment) Preferred synonym Dismissal Discharge 3 (Medical symptom) Discharge 4 (Electrical) The terminological structure is equally important in controlled vocabulary and in free-text searching. In free-text searching, synonym relationships are used "in reverse" for synonym expansion of query terms, and homonym indicators are used to initiate a question to the user on the intended meaning of the term in the query. 2.1.2 Conceptual structure A well-developed conceptual structure is a sine qua non for user-centered indexing. But I posit that beyond that it can be extremely useful in any kind of retrieval system, including free-text retrieval. Examples will illustrate this point. Semantic factoring or feature analysis. Facets Semantic factoring means analyzing a concept into its defining components or elemental concepts; linguists speak of feature analysis.