GoogleLing : The Web as a Linguistic Corpus

We describe software to transform any search engine or searchable corpus into a tool for linguistic research with a rich query syntax. We provide support for case sensitive searches, within-sentence and within-N-words match constraints, part-ofspeech restrictions on words, and “smart” verb-ending inflection wildcards. The software generalizes the query for the underlying search engine, and then processes the resulting pages with a set of natural language processing tools to extract matching sentences. Preliminary evaluation suggests that this greatly enhances linguists’ ability to use the web as a linguistic corpus.