Proceedings of the 2010 Named Entities Workshop

Named Entities play a significant role in Natural Language Processing and Information Retrieval. While identifying and analyzing named entities in a given natural language is a challenging research problem by itself, the phenomenal growth in the Internet user population, especially among the non-English speaking parts of the world, has extended this problem to the crosslingual arena. We specifically focus on research on all aspects of the Named Entities in our workshop series, Named Entities WorkShop (NEWS). The first of the NEWS workshops (NEWS 2009) was held as a part of ACL-IJCNLP 2009 conference in Singapore, and the current edition (NEWS 2010) is being held as a part of ACL 2010, in Uppsala, Sweden. The purpose of the NEWS workshop is to bring together researchers across the world interested in identification, analysis, extraction, mining and transformation of named entities in monolingual or multilingual natural language text. The workshop scope includes many interesting specific research areas pertaining to the named entities, such as, orthographic and phonetic characteristics, corpus analysis, unsupervised and supervised named entities extraction in monolingual or multilingual corpus, transliteration modelling, and evaluation methodologies, to name a few. For this years edition, 11 research papers were submitted, each of which was reviewed by at least 3 reviewers from the program committee. 7 papers were chosen for publication, covering main research areas, from named entities recognition, extraction and categorization, to distributional characteristics of named entities, and finally a novel evaluation metrics for co-reference resolution. All accepted research papers are published in the workshop proceedings. This year, as parts of the NEWS workshop, we organized two shared tasks: one on Machine Transliteration Generation, and another on Machine Transliteration Mining, participated by research teams from around the world, including industry, government laboratories and academia. The transliteration generation task was introduced in NEWS 2009. While the focus of the 2009 shared task was on establishing the quality metrics and on baselining the transliteration quality based on those metrics, the 2010 shared task expanded the scope of the transliteration generation task to about dozen languages, and explored the quality depending on the direction of transliteration, between the languages. We collected significantly large, hand-crafted parallel named entities corpora in dozen different languages from 8 language families, and made available as common dataset for the shared task. We published the details of the shared task and the training and development data six months ahead of the conference that attracted an overwhelming response from the research community. Totally 7 teams participated in the transliteration generation task. The approaches ranged from traditional unsupervised learning methods (such as, Phrasal SMT-based, Conditional Random Fields, etc.) to somewhat unique approaches (such as, DirectTL approach), combined with several model combinations for results re-ranking. A report of the shared task that summarizes all submissions and the original whitepaper are also included in the proceedings, and will be presented in the workshop. The participants in the shared task were asked to submit short system papers (4 pages each) describing their approach, and each of such papers was reviewed by at least two members of the program committee to help improve the quality of the content and presentation of the papers. 6 of them were finally accepted to be published in the workshop proceedings (one participating team did not submit their system paper in time). NEWS 2010 also featured a second shared task this year, on Transliteration Mining; in this shared task we focus specifically on mining transliterations from the commonly available resource Wikipedia titles. The objective of this shared task is to identify transliterations from linked Wikipedia titles between English and another language in a non-Latin script. 5 teams participated in the mining task, each participating in multiple languages. The shared task was conducted in 5 language pairs, and the paired Wikipedia titles between English and each of the languages was provided to the participants. The participating systems output was measured using three specific metrics. All the results are reported in the shared task report. We hope that NEWS 2010 would provide an exciting and productive forum for researchers working in this research area. The technical programme includes 7 research papers and 9 system papers (3 as oral papers, and 6 as poster papers) to be presented in the workshop. Further, we are pleased to have Dr Dan Roth, Professor at University of Illinois and The Beckman Institute, delivering the keynote speech at the workshop.