Multilingual database and crosslingual interrogation in a real internet application Architecture and problems of implementation
暂无分享,去创建一个
The EMIR European project demonstrated the feasibility of a crosslingual interrogation of fulltext databases using a mono and a bilingual general reformulation of the query. New developments have been done to accept multilingual databases even if there is more than one language in the same documents. Problems of implementation of full-scale applications are discussed. Background Our crosslingual text retrieval technology is based on the results of the EMIR (Fluhr 1994) (European Multilingual Information Retrieval) project in the framework of the ESPRIT European Program . The project was done on three languages : English, French and German, with partners from France, Germany and Belgium. The project was performed from October 90 to April 94. The project had as technological support an already existing multimonolingual (French and English) text retrieval software : SPIRIT (Syntactic and Probabilistic Indexing and Retrieval of Information in Texts). SPIRIT has been marketed since 1980 on IBM mainframe, since 1985 on various platforms (from PC to mainframes) and from 1993 in a client-server architecture. At this time a part of EMIR results are introduced in the SPIRIT system. Applications are especially for the FrenchEnglish couple. A new language has been added : Russian and the Dutch language is on the way. Main principles of the approach of crosslingual interrogation The EMIR’s approach is based on the use of a general transfer dictionary as a set of reformulation rules. That means that all possible translations are proposed as possible key for retrieving the documents. This approach is opposite to the one consisting of a translation of the query followed by a monolingual interrogation. There are 3 differences : • the reformulation is based on the translation of concepts in the query and need not to build up a syntactically and semantically correct translated query. • the translation of concepts is used to obtain documents, if the word inferred in the target language is not the right translation but if a relevant documents is obtained we can consider that the system works properly ( for example : inference of an hyperonym or a word of a different part of speech). • The translation and retrieval process are mixed so that the fulltext database is used as a semantic knowledge for solving translation ambiguities which are the main problem to solve in this approach of domain independent reformulation. One of the main result of this approach is that if an answer to the query exists in the database, the system, in most cases, can select the right translation by looking in most relevant documents. Principle of the method We suppose that the database is monolingual, we will discuss in the following paragraphs the problems specific to multilingual databases. Database processing The database is processed by the morphosyntactic parser. The results are normalized single words and compounds. Normalization is mainly based on a lemmatization but general synonymies can be taken into account. For example « logiciel » and « software » in French are normalized by « logiciel », « harbour » and « harbor » in English are normalized by « harbor ». As single words we assume really single words and idiomatic expressions like « monkey wrench » in GB English or « clé anglaise » in French. Compounds are words in dependency relations like « multilingual database ». For each normalized word or compound a semantic weight is computed according to the information it brings to choose the relevent documents. Query processing The query is processed by the same morphosyntactic parsing than for the database. Normalized words and compounds are produced with their part of speech. For each of these units, we try to infer all possible translations that agree with the part of speech. For example « light » adjective is translated by « léger » adjective in French but « light » noun is translated by « lumière » noun. Compounds can be translated globally or word for word. In this last case the word order is rearranged to fit the result of the target language normalization of compounds. All compounds that cannot be translated word for word but it is not necessary to consider them as idiomatic expression. A compound like seat belt is really a belt on a seat and in French « ceinture de sécurité » is a belt for security. Generally, especially for single words, there is a lot of translations. Example : « talon » (French)--à(English) « heel », « crust », « spur », « stub », « conterfoil », « talon » At this level the results of multilingual inference is filtered by the database lexicon and a lot of translations that are incompatible with the domain are eliminated. The filtering by the database lexicon is not sufficient to eliminate all wrong translations. So it is possible to take the translations contained in the most relevant documents (that means the ones that contain the maximum of the query words, especially the ones where words have the same dependency relations than in the query). It is necessary before performing this optimization to be quite sure that the « best » documents are relevant, that is to say that they contain a sufficient number of the most weighty words. If it is decided that the most relevant documents are really relevant ones, a feed back can be done on the transfer process. In a second step, only words compatible with the most relevant documents are proposed. This process is very strong to increase relevance but it has a bad effect on the recall because it can eliminate synonyms of the chosen words that are only in less relevant documents. So it is useful to follow this feed back by a monolingual reformulation in the target language. We are in the same situation that a well formed query directly in the target language or a well translated query that necessitates a monolingual reformulation to have a good recall. Example of translation and filtering Query : « spectroscopie de masse par temps de vol » on a base of 655000 titles of reports on Energy
[1] Christian Fluhr,et al. About reformulation in full-text IRS , 1989, Inf. Process. Manag..
[2] Christian Fluhr,et al. Textual database lexicon used as a filter to resolve semantic ambiguity application on multilingual , 1995 .
[3] Jin Yang,et al. An Application of Machine Translation Technology in Multilingual Information Retrieval , 1996 .
[4] Christian Fluhr,et al. Multilingual access to textual databases , 1991, RIAO.