Rapid resource transfer for multilingual natural language processing

Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora, and (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was designed for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data.

[1]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Sergei Nirenburg,et al.  Towards A Universal Tool For NLP Resource Acquisition , 2000, LREC.

[4]  Jason Baldridge,et al.  Ensemble-based Active Learning for Parse Selection , 2004, NAACL.

[5]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[6]  Jason Baldridge,et al.  Active learning for HPSG parse selection , 2003, CoNLL.

[7]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[8]  Noah A. Smith,et al.  Bilingual Parsing with Factored Estimation: Using English to Parse Korean , 2004, EMNLP.

[9]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  A. Ardeshir Goshtasby,et al.  Contextual word recognition using probabilistic relaxation labeling , 1988, Pattern Recognit..

[11]  Sergei Nirenburg,et al.  Project Boas: "A Linguist in the Box" as a multi-purpose language resource , 1998, LREC.

[12]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[13]  Allen R. Hanson,et al.  Context in word recognition , 1976, Pattern Recognition.

[14]  Emanuele Pianta,et al.  Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus , 2004, COLING.

[15]  Dekai Wu Trainable Coarse Bilingual Grammars for Parallel Text Bracketing , 1995, VLC@ACL.

[16]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[17]  Douglas W. Oard,et al.  Parsing and Tagging of Bilingual Dictionaries , 2003 .

[18]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[19]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[20]  R. Rivest Learning Decision Lists , 1987, Machine Learning.

[21]  David Yarowsky,et al.  A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text , 1999 .

[22]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[23]  David Yarowsky,et al.  Inducing Information Extraction Systems for New Languages via Cross-language Projection , 2002, COLING.

[24]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[25]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[27]  Philip Resnik,et al.  The Bible and multilingual optical character recognition , 2005, CACM.

[28]  Clare R. Voss,et al.  When is an Embedded MT System “Good Enough” for Filtering? , 2000, NAACL-ANLP 2000 Workshop on Embedded machine translation systems -.

[29]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[30]  Robert A. Wagner,et al.  An Extension of the String-to-String Correction Problem , 1975, JACM.

[31]  Ulrich Germann,et al.  Greedy Decoding for Statistical Machine Translation in Almost Linear Time , 2003, NAACL.

[32]  Walter S. Rosenbaum,et al.  Multifont OCR Postprocessing System , 1975, IBM J. Res. Dev..

[33]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[34]  Kevin Knight A Statistical MT Tutorial Workbook , 2003 .

[35]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[36]  Srinivas Bangalore,et al.  Learning Dependency Translation Models as Collections of Finite-State Head Transducers , 2000, Computational Linguistics.

[37]  Susan T. Dumais,et al.  Improved string matching under noisy channel conditions , 2001, CIKM '01.

[38]  Rada Mihalcea,et al.  Letter Level Learning for Language Independent Diacritics Restoration , 2002, CoNLL.

[39]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[40]  Eric C. Jensen,et al.  Retr ieving OCR Text : A Survey of Current Approaches , 2002 .

[41]  M. Volk,et al.  Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[42]  Emanuele Pianta,et al.  Knowledge Intensive Word Alignment with KNOWA , 2004, COLING.

[43]  Daniel Gildea,et al.  Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[44]  Raymond J. Mooney,et al.  Learning Parse and Translation Decisions from Examples with Rich Context , 1997, ACL.

[45]  Dekai Wu,et al.  An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[46]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[47]  Natalia Grabar,et al.  Accenting unknown words in a specialized language , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[48]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[49]  Klaus U. Schulz,et al.  Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[50]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[51]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[52]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[53]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.

[54]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[55]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[56]  Philip Resnik,et al.  Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[57]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[58]  Isabelle Guyon,et al.  Design of a linguistic postprocessor using variable memory length Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[59]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[60]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[61]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[62]  Daniel Marcu,et al.  Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[63]  Emanuele Pianta,et al.  Opportunistic Semantic Tagging , 2002, LREC.

[64]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[65]  Boonserm Kijsirikul,et al.  Combining Trigram and Winnow in Thai OCR Error Correction , 1998, COLING.

[66]  Lori S. Levin,et al.  Challenges in automated elicitation of a controlled bilingual corpus. , 2002, TMI.

[67]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[68]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[69]  Christiane Fellbaum,et al.  Building Semantic Concordances , 1998 .

[70]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[71]  Douglas W. Oard,et al.  The surprise language exercises , 2003, TALIP.

[72]  Cyril N. Alberga,et al.  String similarity and misspellings , 1967, CACM.

[73]  Philip Resnik,et al.  Word Sense Disambiguation within a Multilingual Framework , 2003 .

[74]  Mona Diab An Unsupervised Approach for Bootstrapping Arabic Sense Tagging , 2004 .

[75]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[76]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[77]  M. McShane,et al.  Slavic as Testing Grounds for a Linguistic Knowledge Elicitation System , 2002 .

[78]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[79]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[80]  Martha Palmer,et al.  Handling Structural Divergences and Recovering Dropped Arguments in a Korean / English Machine Translation System ? , 2000 .

[81]  Douglas W. Oard,et al.  Improved Cross-Language Retrieval using Backoff Translation , 2001, HLT.

[82]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[84]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[85]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[86]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[87]  Giorgio Satta,et al.  Generalized Multitext Grammars , 2004, ACL.

[88]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[89]  Jonas Kuhn Experiments in parallel-text based grammar induction , 2004, ACL.

[90]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[91]  I. Dan Melamed,et al.  Multitext Grammars and Synchronous Parsers , 2003, NAACL.

[92]  Azriel Rosenfeld,et al.  Scene Labeling by Relaxation Operations , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[93]  Rebecca Hwa,et al.  Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[94]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[95]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[96]  Shankar Kumar,et al.  A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation , 2003, NAACL.

[97]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[98]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[99]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[100]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[101]  Philip Resnik,et al.  Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[102]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[103]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[104]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[105]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[106]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[107]  Nick Littlestone,et al.  Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm , 2004, Machine Learning.

[108]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[109]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[110]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[111]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[112]  Richard M. Schwartz,et al.  Multilingual Machine Printed OCR , 2001, Int. J. Pattern Recognit. Artif. Intell..

[113]  Philip Resnik,et al.  Breaking the Resource Bottleneck for Multilingual Parsing , 2002 .

[114]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[115]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[116]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[117]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[118]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[119]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[120]  Jianying Hu,et al.  Language modeling using stochastic automata with variable length contexts , 1997, Comput. Speech Lang..

[121]  B. John Oommen,et al.  A formal theory for optimal and information theoretic syntactic pattern recognition , 1998, Pattern Recognit..

[122]  Douglas W. Oard,et al.  Translation lexicon acquisition from bilingual dictionaries , 2001, IS&T/SPIE Electronic Imaging.

[123]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[124]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[125]  Enrique Vidal,et al.  Efficient Error-Correcting Viterbi Parsing , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[126]  Bonnie J. Dorr,et al.  Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[127]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[128]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[129]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[130]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[131]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars, with Application to Segmentation, Bracketing, and Alignment of Parallel Corpora , 1995, IJCAI.

[132]  Anne Abeillé,et al.  Treebanks: Building and Using Parsed Corpora , 2003 .

[133]  Sargur N. Srihari,et al.  Integrating diverse knowledge sources in text recognition , 1982, TOIS.

[134]  Heidi Fox,et al.  Phrasal Cohesion and Statistical Machine Translation , 2002, EMNLP.

[135]  Lori Levin,et al.  Design and implementation of controlled elicitation for machine translation of low-density languages , 2001, MTSUMMIT.

[136]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[137]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[138]  Daniel P. Lopresti Robust retrieval of noisy text , 1996, Proceedings of the Third Forum on Research and Technology Advances in Digital Libraries,.

[139]  Michael Hess,et al.  Link2Tree: A Dependency-Constituency Converter , 2002 .

[140]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[141]  Bidyut Baran Chaudhuri,et al.  OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[142]  Philip Resnik,et al.  Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation , 2004, CICLing.

[143]  Philip Resnik,et al.  OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[144]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .