Exploring Portability of Syntactic Information from English to Basque

This paper explores a crosslingual approach to the PP attachment problem. We built a large dependency database for English based on an automatic parse of the BNC, and Reuters (sports and finances sections). The Basque attachment decisions are taken based on the occurrence frequency of the translations of the Basque (verb-noun) pairs in the English syntactic database. The results show that with this simple technique it is possible to transfer syntactic information from a language like English in order to make PP attachment decisions in another language, in this case Basque. 1 Authors listed in alphabetical order. Introduction & Motivation This work is comprised in a broader endeavor in the context of the MEANING project (Rigau et al., 2002), with the goal of exploring the possibility of porting linguistic knowledge acquired in one language to another. This portability issue could be especially relevant for minority languages with few resources like Basque. Hence the main motivation underlying this experiment is to explore ways to overcome the limitations originated by the lack of resources. If we were able to transfer some of the linguistic knowledge available for English to other languages we would effectively reduce some of the restrictions in these languages (small corpora, lack of hand annotated corpora, etc.). Cross-language information transfer is not something new, however most of the work done relies on the usage of parallel corpora (Hwa et al 2002), which are difficult to find, specially for lesser studied languages. This is one of the reasons that lead us to consider the usage of comparable corpora, since it is easier to obtain. Another noteworthy aspect is the pair of languages selected for the experiment: English and Basque. Hypothetically, these two languages are linguistically distant enough to make this work extensible to any other language pair. The following could be a short characterization of the most relevant differences between the two languages: ??English is a head initial language with an SVO word order, while Basque is a head final free word order language. ??English does not show strong morphology, while Basque does. ??English is not a pro-drop language, and Basque is a three-way pro-drop language. ??English and Basque do not belong to the same typological family. We chose the PP attachment problem in order to explore the portability issue. This problem is especially hard for free word order languages like Basque. Our current partial parser makes attachment decisions based on certain rules and heuristics. Our experiment has been devised to transfer attachment information coming from English parsed data making the attachment decisions for Basque based on this transferred information. The basic idea behind the system presented here is that verbs show certain preferences on the nouns they appear with. Therefore, if we have a sentence with two verbs, and some noun phrases, one of the verbs will show higher preference for some of the noun phrases while the other verb will show higher preference for the others. We will make one assumption beyond this basic idea, the assumption being that these preferences happen and to some extent can be transferred cross-linguistically (Agirre et al. 2003). Note that this is a preliminary work so at this point we aim to keep the system as simple as possible. Thus, higher co-occurrence of the verb and a noun will be taken to be higher preference of that verb over that noun. The results obtained suggest that cross language transferring of knowledge acquired from comparable corpora, is worth pursuing. Even employing a very simple machinery, results seem very promising. Outline of the method Our starting point was the Basque parser described in (Aldezabal et al 2000). This parser uses a unification grammar to build syntactic structures. Having a sentence it chunks it into phrases, finds the head of each phrase and then applying certain rules and heuristics tries to link those heads to the different verbs belonging to the sentence. To test our attachment system, we selected sentences with two verbs, and used the Basque parser to obtain information about the chunks in the sentences. The attachment information provided by the parser is discarded, maintaining only the chunking information. The heads of the noun groups are extracted, and a set of all possible syntactically dependent (verb-noun) pairs are constructed. The goal was to select for each noun which verb should it be attached to from the two possibilities. The method works as follows. We first obtain from the Basque sentence the verbs and surrounding heads. We translate them into English using a bilingual dictionary, and for each (verb-noun) Basque pair we search all possible translation combinations in the dependency database built from an automatically parsed English corpus. Take for example this Basque sentence, Lendakariak hautezkundeak irabazi zituen botoen %60 lortuz inbersoreen artean. The president won the election obtaining 60% of the votes among the investors. The verbs and heads obtained by the Basque parser/chunker are the following: NP-ergative(lendakaria) NP-absolutive(elections) PPabsolutive(boto) PP-distributive(Inbertsore) V1(irabazi) V2(lortu) We translate all the nouns and verbs. NP-ergative(lendakaria): President, chairman (ncsubj) NP-absolutive(elections): poll, election Ppabsolutive(boto): vote,vow PP-distributive(Inbertsore): investor, shareholder V1(irabazi): to win, to earn, to gain V2(lortu): to get, to obtain, to attain All possible English noun-verb pairs are created with the corresponding English relation or preposition for each Basque case, for example for lendakari-irabazi vs. lendakari lortu: win-President-ncsubj get-President-ncsubj earn-President-ncsubj obtain-President-ncsubj gain-President-ncsubj vs. attain-President-ncsubj win-President-ncsubj get-President-ncsubj earn-President-ncsubj obtain-President-ncsubj gain-President-ncsubj attain-President-ncsubj Note that we only search for the English verb and noun translations occurring in a direct syntactic dependency (moreover, we search for an English syntactic dependency equivalent to the Basque one). We collect and add the frequencies of all translated English pairs for each (verbnoun) Basque pair. In order to select the correct attachment for each noun, the mutual information of the two (verb-noun) pairs are compared. This way we normalize over the amount of translations, and also over the occurrences of the English translations in the target corpus. P(any-EVT, any-ENT) MI(BV,BN)= log P(any-EVT)*P(any-ENT) P(any-EVT, any-ENT) corresponds to the probability of finding any translation of the Basque verb with any translation of the Basque noun in the English corpus. P(any-EVT) corresponds to the probability of finding any translation of the Basque verb in the English corpus, and P(any-ENT) corresponds to the probability of finding any translation of the Basque noun in the English corpus. A higher Mutual information value (maintaining the same syntactic relation in both languages) is taken as an indicative of a stronger preference between the head and one of the verbs, the one that will be selected. As mentioned above, we intended to keep the same syntactic relation across both languages when searching. For that, we employed the information provided by the Basque morphological case attached to each noun as an indicative of this relation. There is an equivalence between Basque morphological cases and English prepositions. This equivalence is not one-to-one, thus each Basque case will have several English prepositions as possible translations, and the opposite. Bilingual dictionaries do not contain such information, so we used the equivalence table described in (Lersundi et al 2002). In this equivalence table all possibilities are listed, even low frequency and rare ones. The RASP parser does not incorporate exhaustive information about multiwords, and therefore we included a heuristic method to search for them. So for example, the Basque verb bilatu is translated as “look up” in English. In “look up in the dictionary”, we would like to have a dependency between “look up” and dictionary. The parser will find that dictionary is a dependent of look , through the preposition in, and up will appear as a particle of look in another relation. The heuristic applied consists of searching for the pair look-dictionary related through the preposition in, and checking that up also appears as a particle of look in the same sentence. Still, certain multiwords need more complex processing. For instance the Basque verb garestitu is a result of an incorporation process and it is translated as “to make more expensive”. At this point we are not treating such multiword translations, and they would return a 0 frequency on the search.