Using suffix arrays as language models: Scaling the n-gram

In this article, we propose the use of suffix arrays to implement n-gram language models with practically unlimited size n. These unbounded n-grams are called ∞-grams. This approach allows us to use large contexts efficiently to distinguish between different alternative sequences while applying synchronous back-off. From a practical point of view, the approach has been applied within the context of spelling confusibles, verb and noun agreement and prenominal adjective ordering. These initial experiments show promising results and we relate the performance to the size of the n-grams used for disambiguation.

[1]  Ming Zhou,et al.  Detecting Erroneous Sentences using Automatically Mined Sequential Patterns , 2007, ACL.

[2]  Martin Chodorow,et al.  An Unsupervised Method for Detecting Grammatical Errors , 2000, ANLP.

[3]  James Shaw,et al.  Ordering Among Premodifiers , 1999, ACL.

[4]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[5]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[6]  David M. W. Powers,et al.  Large scale experiments on correction of confused words , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[7]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[8]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[9]  Fang Li,et al.  Multi-Level Feature Extraction for Spelling Correction , 2007 .

[10]  N. A-R A E H A N,et al.  Detecting errors in English article usage by non-native speakers , 2006 .

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Dominiek Sandra,et al.  Zo helder en toch zoveel fouten! Wat leren we uit psycholinguïstisch onderzoek naar werkwoordfouten bij ervaren spellers? , 2001 .

[13]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[14]  Herman Stehouwer,et al.  Putting the t where it belongs : Solving a confusion problem in Dutch , 2008 .

[15]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[16]  Herman Stehouwer,et al.  Language Models for Contextual Error Detection and Correction , 2009 .

[17]  Eckhard Bick,et al.  A Constraint Grammar Based Spellchecker for Danish with a Special Focus on Dyslexics , 2006 .

[18]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[19]  Kaius Sinnemäki,et al.  A man of measure : Festschrift in Honour of Fred Karlsson on His 60th Birthday , 2006 .

[20]  Stephanie Seneff,et al.  Automatic grammar correction for second-language learners , 2006, INTERSPEECH.

[21]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[22]  Rob Malouf,et al.  The Order of Prenominal Adjectives in Natural Language Generation , 2000, ACL.

[23]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[24]  Stephanie Seneff,et al.  Correcting Misuse of Verb Forms , 2008, ACL.