Automatic and unsupervised methods in natural language processing

Natural language processing (NLP) means the computer-aided processing of language produced by a human. But human language is inherently irregular and the most reliable results are obtained when a human is involved in at least some part of the processing. However, manual workis time-consuming and expensive. This thesis focuses on what can be accomplished in NLP when manual workis kept to a minimum. We describe the construction of two tools that greatly simplify the implementation of automatic evaluation. They are used to implement several supervised, semi-supervised and unsupervised evaluations by introducing artificial spelling errors. We also describe the design of a rule-based shallow parser for Swedish called GTA and a detection algorithm for context-sensitive spelling errors based on semi-supervised learning, called ProbCheck. In the second part of the thesis, we first implement a supervised evaluation scheme that uses an error-free treebankto determine the robustness of a parser when faced with noisy input such as spelling errors. We evaluate the GTA parser and determine the robustness of the individual components of the parser as well as the robustness for different phrase types. Second, we create an unsupervised evaluation procedure for parser robustness. The procedure allows us to evaluate the robustness of parsers using different parser formalisms on the same text and compare their performance. Five parsers and one tagger are evaluated. For four of these, we have access to annotated material and can verify the estimations given by the unsupervised evaluation procedure. The results turned out to be very accurate with few exceptions and thus, we can reliably establish the robustness of an NLP system without any need of manual work. Third, we implement an unsupervised evaluation scheme for spell checkers. Using this, we perform a very detailed analysis of three spell checkers for Swedish. Last, we evaluate the ProbCheck algorithm. Two methods are included for comparison: a full parser and a method using tagger transition probabilities. The algorithm obtains results superior to the comparison methods. The algorithm is also evaluated on authentic data in combination with a grammar and spell checker.

[1]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[2]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1998, ACL.

[3]  Steven P. Abney,et al.  Bootstrapping , 2002, ACL.

[4]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[5]  Radford,et al.  转换生成语法教程 = Transformational Grammar , 2000 .

[6]  Jörg Tiedemann,et al.  Scaling Up an MT Prototype for Industrial Use - Databases and Data Flow , 2002, LREC.

[7]  Rickard Domeij,et al.  Detection of Spelling Errors in Swedish Not Using a Word List En Clair , 1994, J. Quant. Linguistics.

[8]  Ted Briscoe,et al.  Parser evaluation: a survey and a new proposal , 1998, LREC.

[9]  Ola Knutsson,et al.  Unsupervised Evaluation of Parser Robustness , 2005, CICLing.

[10]  Margaret King,et al.  Evaluation of natural language processing systems , 1991 .

[11]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[12]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[13]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[14]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[15]  李幼升,et al.  Ph , 1989 .

[16]  James L. Peterson,et al.  A note on undetected typing errors , 1986, CACM.

[17]  Eric Atwell,et al.  How to Detect Grammatical Errors in a Text Without Parsing It , 1987, EACL.

[18]  Gunnel Källgren Parsing without lexicon: the MorP system , 1991, EACL.

[19]  Walter Daelemans,et al.  Introduction to Special Issue on Machine Learning Approaches to Shallow Parsing , 2002, J. Mach. Learn. Res..

[20]  Benny Brodda An Experiment With Heuristic Parsing Of Swedish , 1983, EACL.

[21]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[22]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[23]  Ola Knutsson,et al.  A Robust Shallow Parser for Swedish , 2003 .

[24]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[25]  Johan Carlberger,et al.  Implementing an Efficient Part-Of-Speech Tagger , 1999, Softw. Pract. Exp..

[26]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[27]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[28]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[29]  Risto Miikkulainen,et al.  Subsymbolic Case-Role Analysis of Sentences With Embedded Clauses , 1993, Cogn. Sci..

[30]  Roberto Basili,et al.  Parsing engineering and empirical robustness , 2002, Natural Language Engineering.

[31]  Sofie Johansson Kokkinakis,et al.  A Cascaded Finite-State Parser for Syntactic Analysis of Swedish , 1999, EACL.

[32]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[33]  Eva I. Ejerhed,et al.  Finite state segmentation of discourse into clauses , 1996, Natural Language Engineering.

[34]  Wolfgang Menzel,et al.  Robust Processing of Natural Language , 1995, KI.

[35]  Atro Voutilainen Parsing Swedish , 2001, NODALIDA.

[36]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[37]  Ted Briscoe ROBUST PARSING | A BRIEF OVERVIEW , 1996 .

[38]  Johan Carlberger,et al.  Implementing an efficient part‐of‐speech tagger , 1999 .

[39]  Dan Roth,et al.  A Learning Approach to Shallow Parsing , 1999, EMNLP.

[40]  Gunnar Eriksson,et al.  The Linguistic Annotation System of the Stockholm - Umea , 1993, EACL.

[41]  Martin Gellerstam,et al.  The Bank of Swedish , 2000, LREC.

[42]  Johnny Bigert Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge , 2002 .

[43]  Eneko Agirre,et al.  Towards a Single Proposal in Spelling Correction , 1998, COLING-ACL.

[44]  Björn Hammarberg Svenskan i ljuset av invandrares språkfel , 1977 .

[45]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[46]  Anna Sågvall Hein,et al.  An Experimental Parser , 1982, COLING.

[47]  Jennifer Foster,et al.  Parsing Ungrammatical Input: an Evaluation Procedure , 2004, LREC.

[48]  Bernard Lang,et al.  Parsing Incomplete Sentences , 1988, COLING.

[49]  Manuel Vilares Ferro,et al.  Robust Parsing Using Dynamic Programming , 2003, CIAA.

[50]  Manuel Vilares Ferro,et al.  Parsing Incomplete Sentences Revisited , 2004, CICLing.

[51]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[52]  Juhani Birn Detecting grammar errors with Lingsoft's Swedish grammar checker , 1999, NODALIDA.

[53]  Risto Miikkulainen,et al.  Incremental nonmonotonic parsing through semantic self-organization , 2003 .

[54]  Masaru Tomita,et al.  Parsing noisy sentences , 1988, COLING.

[55]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[56]  Ola Knutsson,et al.  Automatic Evaluation of Robustness and Degradation in Tagging and Parsing , 2003 .

[57]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[58]  Rickard Domeij,et al.  Granska-an efficient hybrid system for Swedish grammar checking , 1999, NODALIDA.

[59]  Björn Gambäck Processing Swedish sentences : a unification-based grammar and some applications , 1997 .

[60]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[61]  Ola Knutsson,et al.  Designing and developing a language environment for second language writers , 2007, Comput. Educ..

[62]  Hans Weigand,et al.  Noun Phrase Representation by System Combination , 2000 .

[63]  Patrizia Paggio,et al.  Validating the TEMAA LE evaluation methodology: a case study on Danish spelling checkers , 1998, Nat. Lang. Eng..

[64]  Antti Arppe Developing a grammar checker for Swedish , 1999, NODALIDA.

[65]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[66]  Beth Ann Hockey,et al.  An approach to Robust Partial Parsing and Evaluation Metrics , 1996 .

[67]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[68]  Beáta Megyesi,et al.  Shallow Parsing with PoS Taggers and Linguistic Features , 2002, J. Mach. Learn. Res..

[69]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[70]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[71]  Yuji Matsumoto,et al.  Towards a More Careful Evaluation of Broad Coverage Parsing Systems , 1996, COLING.

[72]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[73]  John W. Backus,et al.  The syntax and semantics of the proposed international algebraic language of the Zurich ACM-GAMM Conference , 1959, IFIP Congress.

[74]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[75]  AutoEval-A Generic Tool for Automatic Evaluation of Natural Language Applications , 2003 .

[76]  Rickard Domeij,et al.  Implementation Aspects and Applications of a Spelling Correction Algorithm , 1998 .

[77]  Ola Knutsson,et al.  Faking Errors to Avoid Making Errors: Very Weakly Supervised Learning for Error Detection in Writing , 2005 .

[78]  Dan Roth,et al.  Exploring evidence for shallow parsing , 2001, CoNLL.

[79]  Johnny Bigert Probabilistic Detection of Context-Sensitive Spelling Errors , 2004, LREC.

[80]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[81]  Ralph Grishman,et al.  Evaluating Parsing Strategies Using Standardized Parse Files , 1992, ANLP.

[82]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[83]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[84]  Rickard Domeij,et al.  The development and performance of a grammar checker for Swedish : A language engineering perspective , 2004 .

[85]  Ola Knutsson,et al.  Grammar checking for Swedish second language learners , 2004 .