Improving Short Text Classification Using Public Search Engines

In Web2.0 applications, lots of the texts provided by users are as short as 3 to 10 words. A good classification against the short texts can help the readers find needed messages more quickly. In this paper, we proposed a method to expand the short texts with the help of public search engines through two steps. First we searched the short text in a public search engine and crawled the result pages. Secondly we regarded the texts in result pages as some background knowledge of the original short text, and extracted the feature vector from them. Therefore we can choose a proper number of the result pages to obtain enough corpuses for feature vector extraction to solve the data sparseness problem. We conducted some experiments under different situations and the empirical results indicated that this enriched representation of short texts can substantially improve the classification effects.