Web Text Corpus for Natural Language Processing

Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.

[1]  Martin Volk,et al.  Using the web as corpus for linguistic research , 2002 .

[2]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[3]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[4]  Malvina Nissim,et al.  Using the Web in Machine Learning for Other-Anaphora Resolution , 2003, EMNLP.

[5]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[6]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[7]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[8]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[9]  Charles L. A. Clarke,et al.  The impact of corpus size on question answering performance , 2002, SIGIR '02.

[10]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[11]  András Kornai,et al.  Creating Open Language Resources for Hungarian , 2004, LREC.

[12]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[13]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[14]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .

[15]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[16]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[17]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.