Shared Task Organizing Committee -transliteration Mining: Whitepaper of News 2010 Shared Task on Transliteration Generation Transliteration Generation and Mining with Limited Training Resources Transliteration Using a Phrase-based Statistical Machine Translation System to Re-score the Output of a Jo

ii Preface Named Entities play a significant role in Natural Language Processing and Information Retrieval. While identifying and analyzing named entities in a given natural language is a challenging research problem by itself, the phenomenal growth in the Internet user population, especially among the non-English speaking parts of the world, has extended this problem to the crosslingual arena. We specifically focus The purpose of the NEWS workshop is to bring together researchers across the world interested in identification, analysis, extraction, mining and transformation of named entities in monolingual or multilingual natural language text. The workshop scope includes many interesting specific research areas pertaining to the named entities, such as, orthographic and phonetic characteristics, corpus analysis, unsupervised and supervised named entities extraction in monolingual or multilingual corpus, transliteration modelling, and evaluation methodologies, to name a few. For this years edition, 11 research papers were submitted, each of which was reviewed by at least 3 reviewers from the program committee. 7 papers were chosen for publication, covering main research areas, from named entities recognition, extraction and categorization, to distributional characteristics of named entities, and finally a novel evaluation metrics for co-reference resolution. All accepted research papers are published in the workshop proceedings. This year, as parts of the NEWS workshop, we organized two shared tasks: one on Machine Transliteration Generation, and another on Machine Transliteration Mining, participated by research teams from around the world, including industry, government laboratories and academia. The transliteration generation task was introduced in NEWS 2009. While the focus of the 2009 shared task was on establishing the quality metrics and on baselining the transliteration quality based on those metrics, the 2010 shared task expanded the scope of the transliteration generation task to about dozen languages, and explored the quality depending on the direction of transliteration, between the languages. We collected significantly large, hand-crafted parallel named entities corpora in dozen different languages from 8 language families, and made available as common dataset for the shared task. We published the details of the shared task and the training and development data six months ahead of the conference that attracted an overwhelming response from the research community. Totally 7 teams participated in the transliteration generation task. The approaches ranged from traditional unsupervised learning methods (such as, Phrasal SMT-based, Conditional Random Fields, etc.) to somewhat unique approaches (such as, DirectTL approach), combined with several model combinations for results re-ranking. A report of …

[1]  David D. Palmer,et al.  A Statistical Profile of the Named Entity Task , 1997, ANLP.

[2]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[3]  Xian Wu,et al.  Domain Adaptation with Latent Semantic Association for Named Entity Recognition , 2009, NAACL.

[4]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[5]  Thierry Poibeau,et al.  Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[6]  Cécile Paris,et al.  Pseudo Relevance Feedback Using Named Entities for Question Answering , 2006, ALTA.

[7]  Kishore Papineni,et al.  Why Inverse Document Frequency? , 2001, NAACL.

[8]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[9]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[10]  Ani Nenkova,et al.  Automatically Evaluating Content Selection in Summarization without Human Models , 2009, EMNLP.

[11]  Bonnie Holte Bennett,et al.  Named Entity Recognition in Urdu: A Progress Report , 2002, International Conference on Internet Computing.

[12]  Johanna Völker,et al.  Towards large-scale, open-domain and ontology-based named entity classification , 2005 .

[13]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[14]  Rohini K. Srihari,et al.  NE Tagging for Urdu based on Bootstrap POS Learning , 2009 .

[15]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[16]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[17]  Peter Jansen,et al.  Threshold Calibration in CLARIT Adaptive Filtering , 1998, TREC.

[18]  Pabitra Mitra,et al.  A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition , 2008, IJCNLP.

[19]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Pabitra Mitra,et al.  A Hybrid Named Entity Recognition System for South and South East Asian Languages , 2008, IJCNLP.

[23]  Bidyut Baran Chaudhuri,et al.  An Experiment on Automatic Detection of Named Entities in Bangla , 2008, IJCNLP.

[24]  Christine D. Piatko,et al.  Named Entity Recognition using Hundreds of Thousands of Features , 2003, CoNLL.

[25]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[26]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[27]  Ralph Russell Some Notes on Hindi and Urdu , 1996 .

[28]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[29]  Claudio Giuliano,et al.  Instance Based Lexical Entailment for Ontology Population , 2007, EMNLP-CoNLL.

[30]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[31]  P. Praveen,et al.  Hybrid Named Entity Recognition System for South and South East Asian Languages , 2008, IJCNLP.

[32]  Dan Roth,et al.  Understanding the Value of Features for Coreference Resolution , 2008, EMNLP.

[33]  Claire Cardie,et al.  Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art , 2009, ACL.

[34]  Suresh Manandhar,et al.  An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery , 2004 .

[35]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[36]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[37]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[38]  Dan Roth,et al.  Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora , 2006, ACL.

[39]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[40]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[41]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[42]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[43]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[44]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[45]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[46]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[47]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Conditional Random Fields , 2008 .

[48]  Geoffrey E. Hinton Products of experts , 1999 .

[49]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[50]  Carol Friedman,et al.  Introduction: named entity recognition in biomedicine , 2004, J. Biomed. Informatics.

[51]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[52]  Khaled Shaalan,et al.  Person Name Entity Recognition for Arabic , 2007, SEMITIC@ACL.

[53]  Tony McEnery,et al.  Corpus data for South Asian language processing. , 2003 .

[54]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[55]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[56]  Xiaoqiang Luo,et al.  HowtogetaChineseName(Entity): Segmentation and Combination Issues , 2003, EMNLP.

[57]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[58]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[59]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[60]  Sivaji Bandyopadhyay,et al.  Language Independent Named Entity Recognition in Indian Languages , 2008, IJCNLP.

[61]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[62]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[63]  Marie-Francine Moens,et al.  Efficient Hierarchical Entity Classifier Using Conditional Random Fields , 2006, OntologyLearning@COLING/ACL.

[64]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[65]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[66]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[67]  Amit Goyal Named Entity Recognition for South Asian Languages , 2008, IJCNLP.

[68]  Karl Pearson,et al.  ON THE DISTRIBUTION OF THE CORRELATION COEFFICIENT IN SMALL SAMPLES. APPENDIX II TO THE PAPERS OF “STUDENT” AND R. A. FISHER. A COOPERATIVE STUDY , 1917 .

[69]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[70]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[71]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[72]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[73]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[74]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[75]  Chris Clifton,et al.  TopCat: data mining for topic identification in a text corpus , 1999, IEEE Transactions on Knowledge and Data Engineering.

[76]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[77]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[78]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[79]  Sivaji Bandyopadhyay,et al.  Bengali Named Entity Recognition Using Support Vector Machine , 2008, IJCNLP.

[80]  Kenji Yamada,et al.  Syntax-based language models for statistical machine translation , 2003, ACL 2003.

[81]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[82]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[83]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[84]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[85]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.

[86]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[87]  Dipti Misra Sharma,et al.  Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition , 2008, IJCNLP.

[88]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[89]  Sobha Lalitha Devi,et al.  Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields , 2008, IJCNLP.

[90]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[91]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[92]  Claudio Giuliano,et al.  Instance-Based Ontology Population Exploiting Named-Entity Substitution , 2008, COLING.

[93]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[94]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[95]  Claudio Giuliano Fine-Grained Classification of Named Entities Exploiting Latent Semantic Kernels , 2009, CoNLL.

[96]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[97]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[98]  Haizhou Li,et al.  Proceedings of the 2010 Named Entities Workshop , 2010 .

[99]  Marc Moens,et al.  Seventh Message Understanding Conference (MUC-7) , 1998 .

[100]  Vasudeva Varma,et al.  A Character n-gram Based Approach for Improved Recall in Indian Language NER , 2008, IJCNLP.

[101]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[102]  Robert E. Frederking,et al.  SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation , 2010 .

[103]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[104]  Eiichiro Sumita,et al.  Transliteration by Bidirectional Statistical Machine Translation , 2009, NEWS@IJCNLP.

[105]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[106]  Thomas Mandl,et al.  The effect of named entities on effectiveness in cross-language information retrieval evaluation , 2005, SAC '05.

[107]  Michael Fleischman Automated Subcategorization of Named Entities , 2001, ACL.

[108]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[109]  Kashif Riaz,et al.  A Study in Urdu Corpus Construction , 2002, ALR@COLING.

[110]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[111]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[112]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[113]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[114]  Haizhou Li,et al.  Transliteration Alignment , 2009, ACL.

[115]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[116]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[117]  Javier Artiles Picón,et al.  Web people search , 2009 .

[118]  A. Waibel,et al.  Multilingual named entity extraction and translation from text and speech , 2006 .

[119]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[120]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[121]  Kashif Riaz,et al.  Concept search in Urdu , 2008, PIKM '08.

[122]  Bernardo Magnini,et al.  Weakly Supervised Approaches for Ontology Population , 2008, EACL.

[123]  Philip Resnik,et al.  Tagger Evaluation Given Hierarchical Tag Sets , 2000, Comput. Humanit..

[124]  Eiichiro Sumita,et al.  Phrase-based Machine Transliteration , 2008, IJCNLP.

[125]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[126]  Gang Hu,et al.  Chinese Named Entity Recognition Based on Multilevel Linguistic Features , 2004, IJCNLP.

[127]  Yannick Versley,et al.  BART: A Modular Toolkit for Coreference Resolution , 2008, ACL.

[128]  Paul Thompson,et al.  Name Searching and Information Retrieval , 1997, EMNLP.