Flexible Noisy Text Correction

We present a new general and language independent approach to the noisy text correction problem developed and implemented in the framework of the CULTURA project. We briefly describe the core candidate generator, REBELS, the complete system concept, its efficient implementation based on functional automata and its immediate applications. The quality of the whole system is empirically established in different experimental settings where language and noise sources are varied.

[1]  Stefan Gerdjikov,et al.  Realization of common statistical methods in computational linguistics with functional automata , 2013, RANLP.

[2]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[3]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[7]  Stoyan Mihov,et al.  Extraction of Spelling Variations from Language Structure for Noisy Text Correction , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[9]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[10]  Peter Nabende,et al.  Applying dynamic Bayesian networks in transliteration detection and generation , 2011 .

[11]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[12]  Klaus U. Schulz,et al.  Efficient dictionary-based text rewriting using subsequential transducers , 2007, Nat. Lang. Eng..

[13]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007 .

[14]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[15]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[16]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17]  Paul Rayson,et al.  Automatic standardisation of texts containing spelling variation: How much training data do you need? , 2009 .

[18]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[19]  Jorge Nocedal,et al.  Global Convergence Properties of Conjugate Gradient Methods for Optimization , 1992, SIAM J. Optim..

[20]  Ulrich Reffle Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[21]  Richard Sproat,et al.  An Efficient Compiler for Weighted Rewrite Rules , 1996, ACL.

[22]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .