A Large-Scale Query Spelling Correction Corpus

We present a new large-scale collection of 54,772 queries with manually annotated spelling corrections. For 9,170 of the queries (16.74%), spelling variants that are different to the original query are proposed. With its size, our new corpus is an order of magnitude larger than other publicly available query spelling corpora. In addition to releasing the new large-scale corpus, we also provide an implementation of the winner of the Microsoft Speller Challenge from~2011 and compare it on the different publicly available corpora to spelling corrections mined from Google and Bing. This way, we also shed some light on the spelling correction performance of state-of-the-art commercial search systems.

[1]  Sara Javanmardi,et al.  qSpell : Spelling Correction of Web Search Queries using Ranking Models and Iterative Correction , 2011 .

[2]  Dan Roth,et al.  A Discriminative Model for Query Spelling Correction with Latent Structural SVM , 2012, EMNLP.

[3]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[4]  Krishanu Seal,et al.  A Fast Generative Spell Corrector Based on Edit Distance , 2013, ECIR.

[5]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[6]  Xu Sun,et al.  Fast multi-task learning for query spelling correction , 2012, CIKM '12.

[7]  Jan Pedersen,et al.  Review of MSR-Bing web scale speller challenge , 2011, SIGIR '11.

[8]  Saab Mansour,et al.  Spelling Correction of User Search Queries through Statistical Machine Translation , 2015, EMNLP.

[9]  ChengXiang Zhai,et al.  A generalized hidden Markov model with discriminative training for query spelling correction , 2012, SIGIR '12.

[10]  M. de Rijke,et al.  A Survey of Query Auto Completion in Information Retrieval , 2016, Found. Trends Inf. Retr..

[11]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[12]  Jason J. Soo A non-learning approach to spelling correction in web queries , 2013, WWW '13 Companion.

[13]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[14]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[15]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[16]  ChengXiang Zhai,et al.  CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources , 2012, WWW.

[17]  Hongbo Deng,et al.  Learning Parametric Models for Context-Aware Query Auto-Completion via Hawkes Processes , 2017, WSDM.