Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, NEWS@IJCNLP 2009, Singapore, August 7, 2009

Named Entities play a significant role in Natural Language Processing and Information Retrieval. While identifying and analyzing named entities in a given natural language is a challenging research problem by itself, the phenomenal growth in the Internet user population, especially among the non-English speaking parts of the world, has extended this problem to the crosslingual arena. This is the specific research focus for the Named Entities WorkShop (NEWS), being held as a part of ACL-IJCNLP 2009 conference. The purpose of the NEWS workshop is to bring together researchers across the world interested in identification, analysis, extraction, mining and transformation of named entities in monolingual or multilingual natural language text. Under such broad scope as above, many interesting specific research areas pertaining to the named entities are identified, such as, orthographic and phonetic characteristics, corpus analysis, unsupervised and supervised named entities extraction in monolingual or multilingual corpus, transliteration modelling, and evaluation methodologies, to name a few. 17 research papers were submitted, each of which was reviewed by at least 3 reviewers from the program committee. Finally, 9 papers were chosen for publication, covering main research areas, from named entities tagging and extraction, to computational phonology to machine transliteration of named entities. All accepted research papers are published in the workshop proceedings. An important part of the NEWS workshop is the shared task on Machine Transliteration of named entities. Machine transliteration is a vibrant research area as witnessed by increasing number of publications over the last decade in the Computational Linguistics, Natural Language Processing (ACL, EACL, NAACL, IJCNLP, COLING, HLT, EMNLP, etc.), and Information Retrieval (SIGIR, ECIR, AIRS, etc.) conferences, and primarily in languages that use non-Latin based scripts. However, in spite of its popularity, no meaningful comparison could be possible between the research approaches, as the publications tended to be on different language pairs and different datasets, and on a variety of different metrics. For the first time, we organize a shared task as part of the NEWS workshop to provide a common evaluation platform for benchmarking and calibration of transliteration technologies. We collected significantly large, hand-crafted parallel named entities corpora in 7 different languages from 6 language families, and made available as common dataset for the shared task. We defined 6 metrics that are language-independent, intuitive and computationally easy to compute. We published the details of the shared task and the training and development data six months ahead of the conference that attracted an overwhelming response from the research community. Totally 31 teams participated from around the world, including industry, government laboratories and academia. The approaches ranged from traditional unsupervised learning methods (such as, naive-Bayes, Phrasal SMT-based, Conditional Random Fields, etc.) to somewhat unique approaches (such as, sequence prediction models, to Minimum Description Length-based methods, etc.), combined with several model combinations for results re-ranking. While every team submitted standard runs that use only the data provided by the NEWS organizers, many teams also submitted non-standard runs where they were allowed to use any additional data or language specific modules. In total, about 190 task runs were submitted, covering most approaches comprehensively. A report of the shared task that summarizes all submissions and the original whitepaper are also included in the proceedings, and will be presented in the workshop. The participants in the shared task were asked to submit short system papers (4 pages each) describing their approach, and each of such papers was reviewed by at least two members of the program committee; 27 of them were finally are accepted to publish in the workshop proceedings. NEWS 2009 is the first workshop that specifically addresses comprehensively all research avenues concerned with named entities, to the best of our knowledge. Also, the transliteration shared task is the first of its kind, to calibrate such large number of systems using common metrics on common language-specific datasets in a comprehensive set of language pairs.