论文信息 - Rapid resource transfer for multilingual natural language processing

Rapid resource transfer for multilingual natural language processing

Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora, and (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was designed for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data.

Philip Resnik | Okan Kolak | P. Resnik | O. Kolak

[1] Dekai Wu,et al. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2] Michael J. Fischer,et al. The String-to-String Correction Problem , 1974, JACM.

[3] Sergei Nirenburg,et al. Towards A Universal Tool For NLP Resource Acquisition , 2000, LREC.

[4] Jason Baldridge,et al. Ensemble-based Active Learning for Parse Selection , 2004, NAACL.

[5] H. Sebastian Seung,et al. Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[6] Jason Baldridge,et al. Active learning for HPSG parse selection , 2003, CoNLL.

[7] Dan Roth,et al. Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[8] Noah A. Smith,et al. Bilingual Parsing with Factored Estimation: Using English to Parse Korean , 2004, EMNLP.

[9] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[10] A. Ardeshir Goshtasby,et al. Contextual word recognition using probabilistic relaxation labeling , 1988, Pattern Recognit..

[11] Sergei Nirenburg,et al. Project Boas: "A Linguist in the Box" as a multi-purpose language resource , 1998, LREC.

[12] Philip Resnik,et al. The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[13] Allen R. Hanson,et al. Context in word recognition , 1976, Pattern Recognition.

[14] Emanuele Pianta,et al. Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus , 2004, COLING.

[15] Dekai Wu. Trainable Coarse Bilingual Grammars for Parallel Text Bracketing , 1995, VLC@ACL.

[16] Patrick A. V. Hall,et al. Approximate String Matching , 1994, Encyclopedia of Algorithms.

[17] Douglas W. Oard,et al. Parsing and Tagging of Bilingual Dictionaries , 2003 .

[18] Jason Eisner,et al. Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[19] Rayid Ghani,et al. Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[20] R. Rivest. Learning Decision Lists , 1987, Machine Learning.

[21] David Yarowsky,et al. A Comparison of Corpus-Based Techniques for Restoring Accents in Spanish and French Text , 1999 .

[22] Michael Collins,et al. Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[23] David Yarowsky,et al. Inducing Information Extraction Systems for New Languages via Cross-language Projection , 2002, COLING.

[24] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[25] Philip A. Chou,et al. Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[26] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[27] Philip Resnik,et al. The Bible and multilingual optical character recognition , 2005, CACM.

[28] Clare R. Voss,et al. When is an Embedded MT System “Good Enough” for Filtering? , 2000, NAACL-ANLP 2000 Workshop on Embedded machine translation systems -.

[29] Eric Brill,et al. Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[30] Robert A. Wagner,et al. An Extension of the String-to-String Correction Problem , 1975, JACM.

[31] Ulrich Germann,et al. Greedy Decoding for Statistical Machine Translation in Almost Linear Time , 2003, NAACL.

[32] Walter S. Rosenbaum,et al. Multifont OCR Postprocessing System , 1975, IBM J. Res. Dev..

[33] Philipp Koehn,et al. Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[34] Kevin Knight. A Statistical MT Tutorial Workbook , 2003 .

[35] Philip Resnik,et al. OCR error correction using a noisy channel model , 2002 .

[36] Srinivas Bangalore,et al. Learning Dependency Translation Models as Collections of Finite-State Head Transducers , 2000, Computational Linguistics.

[37] Susan T. Dumais,et al. Improved string matching under noisy channel conditions , 2001, CIKM '01.

[38] Rada Mihalcea,et al. Letter Level Learning for Language Independent Diacritics Restoration , 2002, CoNLL.

[39] Ronald Rosenfeld,et al. Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[40] Eric C. Jensen,et al. Retr ieving OCR Text : A Survey of Current Approaches , 2002 .

[41] M. Volk,et al. Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[42] Emanuele Pianta,et al. Knowledge Intensive Word Alignment with KNOWA , 2004, COLING.

[43] Daniel Gildea,et al. Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[44] Raymond J. Mooney,et al. Learning Parse and Translation Decisions from Examples with Rich Context , 1997, ACL.

[45] Dekai Wu,et al. An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[46] Dan Flickinger,et al. On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[47] Natalia Grabar,et al. Accenting unknown words in a specialized language , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[48] Kevin Knight,et al. A Syntax-based Statistical Translation Model , 2001, ACL.

[49] Klaus U. Schulz,et al. Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[50] David Chiang,et al. A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[51] David Yarowsky,et al. Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[52] Michael Collins,et al. A Statistical Parser for Czech , 1999, ACL.

[53] I. Dan Melamed,et al. Statistical Machine Translation by Parsing , 2004, ACL.

[54] Kazem Taghva,et al. OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[55] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[56] Philip Resnik,et al. Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[57] Philip Resnik,et al. Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[58] Isabelle Guyon,et al. Design of a linguistic postprocessor using variable memory length Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[59] Ian H. Witten,et al. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[60] Nianwen Xue,et al. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[61] Rebecca Hwa,et al. Sample Selection for Statistical Parsing , 2004, CL.

[62] Daniel Marcu,et al. Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[63] Emanuele Pianta,et al. Opportunistic Semantic Tagging , 2002, LREC.

[64] Mehryar Mohri,et al. A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[65] Boonserm Kijsirikul,et al. Combining Trigram and Winnow in Thai OCR Error Correction , 1998, COLING.

[66] Lori S. Levin,et al. Challenges in automated elicitation of a controlled bilingual corpus. , 2002, TMI.

[67] Kenneth Ward Church,et al. Probability scoring for spelling correction , 1991 .

[68] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[69] Christiane Fellbaum,et al. Building Semantic Concordances , 1998 .

[70] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[71] Douglas W. Oard,et al. The surprise language exercises , 2003, TALIP.

[72] Cyril N. Alberga,et al. String similarity and misspellings , 1967, CACM.

[73] Philip Resnik,et al. Word Sense Disambiguation within a Multilingual Framework , 2003 .

[74] Mona Diab. An Unsupervised Approach for Bootstrapping Arabic Sense Tagging , 2004 .

[75] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[76] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[77] M. McShane,et al. Slavic as Testing Grounds for a Linguistic Knowledge Elicitation System , 2002 .

[78] David S. Doermann,et al. The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[79] David A. Cohn,et al. Active Learning with Statistical Models , 1996, NIPS.

[80] Martha Palmer,et al. Handling Structural Divergences and Recovering Dropped Arguments in a Korean / English Machine Translation System ? , 2000 .

[81] Douglas W. Oard,et al. Improved Cross-Language Retrieval using Backoff Translation , 2001, HLT.

[82] Theodosios Pavlidis,et al. On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83] Rafael Llobet,et al. Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[84] Hermann Ney,et al. Improved Statistical Alignment Models , 2000, ACL.

[85] W. Bruce Croft,et al. Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[86] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[87] Giorgio Satta,et al. Generalized Multitext Grammars , 2004, ACL.

[88] Richard M. Schwartz,et al. Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[89] Jonas Kuhn. Experiments in parallel-text based grammar induction , 2004, ACL.

[90] R. Mahesh K. Sinha,et al. Visual text recognition through contextual processing , 1988, Pattern Recognit..

[91] I. Dan Melamed,et al. Multitext Grammars and Synchronous Parsers , 2003, NAACL.

[92] Azriel Rosenfeld,et al. Scene Labeling by Relaxation Operations , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[93] Rebecca Hwa,et al. Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[94] Claire Cardie,et al. Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[95] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[96] Shankar Kumar,et al. A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation , 2003, NAACL.

[97] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[98] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[99] David Yarowsky,et al. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[100] Steven P. Abney. Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[101] Philip Resnik,et al. Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[102] Ching Y. Suen,et al. Historical review of OCR research and development , 1992, Proc. IEEE.

[103] Allen R. Hanson,et al. A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[104] Mark Steedman,et al. Bootstrapping statistical parsers from small datasets , 2003, EACL.

[105] W. B. Croft,et al. An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[106] Emmanuel Roche,et al. Finite-State Language Processing , 1997 .

[107] Nick Littlestone,et al. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm , 2004, Machine Learning.

[108] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[109] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[110] Yiming Yang,et al. Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[111] Anoop Sarkar,et al. Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[112] Richard M. Schwartz,et al. Multilingual Machine Printed OCR , 2001, Int. J. Pattern Recognit. Artif. Intell..

[113] Philip Resnik,et al. Breaking the Resource Bottleneck for Multilingual Parsing , 2002 .

[114] Xiang Tong,et al. A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[115] Philip Resnik,et al. An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[116] Fei Xia,et al. Converting Dependency Structures to Phrase Structures , 2001, HLT.

[117] Richard M. Schwartz,et al. Robust language-independent OCR system , 1999, Other Conferences.

[118] Raymond J. Mooney,et al. Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[119] Andrew McCallum,et al. Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[120] Jianying Hu,et al. Language modeling using stochastic automata with variable length contexts , 1997, Comput. Speech Lang..

[121] B. John Oommen,et al. A formal theory for optimal and information theoretic syntactic pattern recognition , 1998, Pattern Recognit..

[122] Douglas W. Oard,et al. Translation lexicon acquisition from bilingual dictionaries , 2001, IS&T/SPIE Electronic Imaging.

[123] Min Tang,et al. Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[124] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[125] Enrique Vidal,et al. Efficient Error-Correcting Viterbi Parsing , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[126] Bonnie J. Dorr,et al. Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[127] Eric Brill,et al. An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[128] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[129] Kazem Taghva,et al. Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[130] David M. Magerman. Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[131] Dekai Wu,et al. Stochastic Inversion Transduction Grammars, with Application to Segmentation, Bracketing, and Alignment of Parallel Corpora , 1995, IJCAI.

[132] Anne Abeillé,et al. Treebanks: Building and Using Parsed Corpora , 2003 .

[133] Sargur N. Srihari,et al. Integrating diverse knowledge sources in text recognition , 1982, TOIS.

[134] Heidi Fox,et al. Phrasal Cohesion and Statistical Machine Translation , 2002, EMNLP.

[135] Lori Levin,et al. Design and implementation of controlled elicitation for machine translation of low-density languages , 2001, MTSUMMIT.

[136] William J. Byrne,et al. A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[137] Anil K. Jain,et al. Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[138] Daniel P. Lopresti. Robust retrieval of noisy text , 1996, Proceedings of the Third Forum on Research and Technology Advances in Digital Libraries,.

[139] Michael Hess,et al. Link2Tree: A Dependency-Constituency Converter , 2002 .

[140] Yan Zhou,et al. Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[141] Bidyut Baran Chaudhuri,et al. OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..

[142] Philip Resnik,et al. Exploiting Hidden Meanings: Using Bilingual Text for Monolingual Annotation , 2004, CICLing.

[143] Philip Resnik,et al. OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[144] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .