Text Mining Using N-Grams

Text mining is the art of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the Stata command ngram which implements the most common approach to text mining, "bag of words''. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. Ngram is illustrated with the categorization of text answers from two open-ended questions.