Determining Unintelligible Words from their Textual Contexts

Abstract We propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different possibilities for the word, a robust, large-scale method is needed. The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 words both on the Brown corpus and on the British National Corpus. Unintelligible words are often the result of errors in Optical Character Recognition (OCR) algorithms. As the proposed method utilizes information independent from information used in OCR, we expect that a combined approach could be very successful, as OCR and the proposed method complement each other.

[1]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[2]  Julien Mairal,et al.  Proximal Methods for Hierarchical Sparse Coding , 2010, J. Mach. Learn. Res..

[3]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.

[4]  Claudia Leacock,et al.  Automated Grammatical Error Correction for Language Learners , 2010, COLING.

[5]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[6]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[10]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[11]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[12]  Diana Inkpen,et al.  Real-Word Spelling Correction using Google Web 1T 3-grams , 2009, EMNLP.

[13]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[14]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[15]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[16]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[17]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[18]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.