论文信息 - Adapting Entities across Languages and Cultures

Adapting Entities across Languages and Cultures

How would you explain Bill Gates to a German? He is associated with founding a company in the United States, so perhaps the German founder Carl Benz could stand in for Gates in those contexts. This type of translation is called adaptation in the translation community (Vinay and Darbelnet, 1995). Until now, this task has not been done computationally. Automatic adaptation could be used in natural language processing for machine translation and indirectly for generating new question answering datasets and education. We propose two automatic methods and compare them to human results for this novel NLP task. First, a structured knowledge base adapts named entities using their shared properties. Second, vector arithmetic and orthogonal embedding mappings identify better candidates, but at the expense of interpretable features. We evaluate our methods through a new dataset1 of human adaptations. 1 When Translation Misses the Mark Imagine reading a translation from German, “I saw Merkel eating a Berliner from Dietsch on the ICE”. This sentence is opaque without cultural context. An extreme cultural adaptation for an American audience could render the sentence as “I saw Biden eating a Boston Cream from Dunkin’ Donuts on the Acela”, elucidating that Merkel is in a similar political post to Biden; that Dietsch (like Dunkin’ Donuts) is a mid-range purveyor of baked goods; both Berliners and Boston Creams are filled, sweet pastries named after a city; and ICE and Acela are slightly ritzier high-speed trains. Human translators make this adaptation when it is appropriate to the translation (Gengshen, 2003). Available at https://go.umd.edu/adaptation Bill Gates Top Adaptations: WikiData 3CosAdd Human F. Zeppelin congstar A. Bechtolsheim Günther Jauch Alnatura Dietmar Hopp N. Harnoncourt GMX Carl Benz Table 1: WikiData and unsupervised embeddings (3CosAdd) generate adaptations of an entity, such as Bill Gates. Human adaptations are gathered for evaluation. American and German entities are color coded. Because adaptation is understudied, we leave the full translation task to future work. Instead, we focus on the task of cultural adaptation of entities: given an entity in a source, what is the corresponding entity in English? Most Americans would not recognize Christian Drosten, but the most efficient explanation to an American would be to say that he is the “German Anthony Fauci” (Loh, 2020). We provide top adaptations suggested by algorithms and humans for another American involved with the pandemic response, Bill Gates, in Table 1. Can machines reliably find these analogs with minimal supervision? We generate these adaptations with structured knowledge bases (Section 3) and word embeddings (Section 4). We elicit human adaptations (Section 5) to evaluate whether our automatic adaptations are plausible (Section 5.3). 2 Wer ist Bill Gates? We define cultural adaptation and motivate its application for tasks like creating culturally-centered training data for QA. Vinay and Darbelnet (1995) define adaptation as translation in which the relationship not the literal meaning between the receiver and the content needs to be recreated. You could formulate our task as a traditional analogy Drosten::Germany as Fauci::United States (Turney, 2008; Gladkova et al., 2016), but despite this superficial resemblance (explored in Section 4), traditional approaches to analogy ignore the influence of culture and are typically within a language. Hence, analogies are tightly bound with culture; humans struggle with analogies outside their culture (Freedle, 2003). We can use this task to identify named entities (Kasai et al., 2019; Arora et al., 2019; Jain et al., 2019) and for understanding other cultures (Katan and Taibi, 2004). 2.1 . . . and why Bill Gates? This task requires a list of named entities adaptable to other cultures. Our entities come from two sources: a subset of the top 500 most visited German/English Wikipedia pages and the nonofficial characterization list (Veale, 2016, NOC), “a source of stereotypical knowledge regarding popular culture, famous people (real and fictional) and their trade-mark qualities, behaviours and settings”. Wikipedia contains a plethora of singers and actors; we filter the top 500 pages to avoid a pop culture skew.2 We additionally select all Germans and a subset of Americans from the Veale NOC list as it is human-curated, verified, and contains a broader historical period than popular Wikipedia pages. Like other semantic relationships (Boyd-Graber et al., 2006), this is not symmetric. Thus, we adapt entities in both directions; while Berlin is the German Washington, DC, there is less consensus on what is the American Berlin, as Berlin is both the capital, a tech hub, and a film hub. A full list of our entities is provided in Appendix D. 3 Adaptation from a Knowledge Base We first adapt entities with a knowledge base. We use WikiData (Vrandečić and Krötzsch, 2014), a structured, human-annotated representation of Wikipedia entities that is actively developed. This resource is well-suited to the task as features are standardized both within and across languages. Many knowledge bases explicitly encode the nationality of individuals, places, and creative works. Entities in the knowledge base are a discrete sparse vector, where most dimensions are unknown or not applicable (e.g., a building does not have a spouse). We discuss the applicability of using Wikipedia (i.e., what proportion of the English Wikipedia is visited from the United States) in Appendix B. For example, Angela Merkel is a human (instance of), German (country of citizenship), politician (occupation), Rotarian (member of), Lutheran (religion), 1.65 meters tall (height), and has a PhD (academic degree). How would we find the “most similar” American adaptation to Angela Merkel? Intuitively, we should find someone whose nation-

Jordan L. Boyd-Graber | Denis Peskov | Viktor Hangya | Alexander Fraser