New resources for recognition of confusable linguistic varieties: the LRE11 corpus

The NIST 2011 Language Recognition Evaluation focuses on language pair discrimination for 24 languages/dialects, some of which may be considered mutually intelligible or closely related. The LRE11 evaluation required new data for all languages, comprising both conversational telephone speech and broadcast narrowband speech from multiple sources in each language. Given the potential confusion among varieties in the collection, manual language auditing required special care including the assessment of inter-auditor consistency. We report on collection methods, auditing approaches, and results. 1. Data Requirements The NIST Language Recognition (LRE) campaigns began in 1996 to with the goal of evaluating performance on language recognition in narrowband speech. The most recent campaign, LRE11, targets language pair discrimination for 24 languages/dialects, some of which may be mutually intelligible to some extent by humans [1]. Data requirements for LRE11 demanded collection of speech sufficient to yield at least 400 narrowband segments for each language. Traditionally LRE evaluations have utilized large collections of conversational telephone speech (CTS). The 2009 LRE corpus represented the first departure from the standard approach in its reliance on narrowband segments embedded in broadcast, typically coming from listener call-ins, phone interviews of pundits and some correspondent reports and man on the street interviews. LRE11 targets collection of both CTS and broadcast narrowband speech (BNBS) for each language, with a few exceptions. Modern Standard Arabic (ara) is a formal variety that wouldn’t typically be spoken during spontaneous conversation and was excluded as a CTS collection target. Conversely, the dialectal Arabic varieties of Iraqi, Levantine and Maghrebi were not expected to appear in formal broadcast news programs and were therefore excluded as a BNBS target. Collection also targeted multiple broadcast sources, where “source” is a provider-program (so Larry King Live is different from CNN Headline News). To satisfy the need for data in languages that might exhibit a high degree of confusability (whether for humans or systems), we reviewed sources including Ethnologue [2] and compiled a preliminary list of candidate languages. Each language was assigned a confusability index score: 1 Throughout the paper we use language as shorthand for a linguistic variety that may be referred to by different sources as a language or dialect. • 0 Not likely to be confusable with another candidate language • 1 Possibly confusable with another candidate language; languages are related and may be confused by (some) systems if not by (most) humans • 2 Likely confusable with another candidate language; evidence that (some) humans may find the varieties mutually intelligible to some extent Table 1: Target Languages in the LRE11 Evaluation When evaluating confusability we took care to distinguish those varieties with multiple names which are generally recognized as the same language (e.g. Persian/Farsi); those which are mutually intelligible varieties but given different language names for historical, social or political reasons (e.g. Hindi and Urdu); and those which are really different languages (or mutually unintelligible, e.g. Mandarin Chinese and Cantonese). From this exercise a set of 38 candidate languages was identified; that list was ultimately whittled down to 24 after researching the availability of broadcast sources for the language and considering the availability of employable native speakers. The final set of LRE11 languages, along with any confusable language varieties for each target, are presented in Table 1. 2. Speaker/Auditor Recruitment and Screening Speaker and auditor recruitment for LRE11 was particularly challenging given the short timeline for data collection and the large number of languages being targeted. The collection model for the CTS component of the corpus was similar to that used in the LDC's first LRE CTS collection (CallFriend, LDC96S46 LDC96S60), but with two notable differences. The original Callfriend protocol was designed to yield exactly one call per speaker: in order to collect 200 speakers per language, we recruited 100 people, and provided incentives to each one in return for making a single phone call to another speaker of their language who was in the U.S. For the LRE11 collection, we recruited fewer individuals per language, and gave them incentives to call as many other speakers of their language as they could; if necessary, they could call to acquaintances outside the U.S. This small core of recruited callers, or "claques", would be present in all recorded calls, so the yield of unique-speaker call sides would be lower than in CallFriend, but this would be offset by the relative efficiency of recruiting. The allowance of overseas calls raised concerns about possible correlations between particular regional telephone networks and particular languages, so we sought to enforce guidelines to ensure that each language would be represented by calls to multiple geographic regions, with a strong preference to have as many callees as possible within the U.S. This methodology added another dimension to the standard set of recruitment challenges: recruits not only had to possess the right combination of language and professional skills, but also had to be socially well connected. Recruitment materials underscored this requirement, stating that it would be necessary to “Contact up to 30 people you know who are fluent speakers of your target language and are willing to have their voices recorded for research purposes.” We targeted a minimum of 3 recruits per language; this number was established to ensure the required CTS collection volume, to permit some amount of dual-auditing for purposes of establishing inter-auditor consistency rates, and to avoid the conflict of interest that would be created by having an individual audit segments from calls where he also acted as a claque. Given the large number of recruits targeted and the short timeline for project completion, it was critical to have an efficient and effective recruitment strategy. Recruitment was broad, targeting local and regional community organizations as well as online user communities. Initial candidate assessment was achieved by means of a multi-part online screening process. The initial screening was administered to any applicant who expressed legitimate interest in the study, and was designed to assess a candidate’s availability and employability, social network density, and competence in the target language. The language skills portion of the screening test addressed many dimensions of competence including how the candidate learned the language and how often the language was used for common tasks like reading the news or conversing with friends and family. Candidates who passed the initial screening were then subject to a secondary test that required them to listen to ten segments of speech and identify those that were in their target language. This language ID test was specifically designed to include segments in languages considered mutually intelligible and/or in the same language family, and as such emulated the actual auditing task required to support LRE11. The test also provided a valuable opportunity both to determine the conceptions of particular languages a candidate might possess and also for us to explain how language categories were being used for the purpose of LRE11. This was especially useful in the case of a language with multiple labels. Dari, for example, is frequently called Farsi by its speakers, but for the purpose of LRE11 needed to be distinguished from Farsi (Persian) as spoken in Iran. Within each LRE11 language category auditors were not expected to be experts on the various dialects of their language. It was accepted that auditors came with their own intuitions about specific languages that may or may not have been in line with the LRE11 categories. Although many applicants were multilingual and were interested in making calls and auditing more than one target language, each recruit was assigned to a single target language (the one for which they demonstrated the highest degree of nativeness). One reason for this restriction was to maximize speaker variety in the collection as a whole; another was to reduce the chances of an applicant overstating their language skills in an attempt to procure more work and therefore greater compensation. Of approximately 130 candidates who took the initial screening test, 84 were ultimately employed as claques and/or auditors.