UALR at TREC: Blog Track

We consider Opinion Blog retrieval from classification point of view. We used the active learning method with an integrated feature selection to train the Support Vector Machine algorithm. We wanted to study the effect of different types of features on the classification accuracy of the model generated by the classifier algorithm. We considered mainly three different types of features for 5 runs submitted. Feature types include bag-of-words features, seed-words as features and statistical features. Bag-ofwords features are generated from the actual blog data. Seed-words were manually generated specific to the domain of interest. Statistical features studied included the ratio of linguistic features to total number of words. We built models using an iterative process and studied accuracy as well as coverage of each model. Study of different features is important in order to build a better model. Feature selection algorithms can choose the best features among the available ones but different features have costs associated with them. We need features that not only predict class labels or contribute towards prediction but the feature should also be representative of the entire dataset, especially test data. Training the classifier on such features will yield better coverage and training accuracy for the model. We compared the three different models generated by three different feature generation strategies. Our preliminary results indicate that seed-words that are specific to a particular domain or particular type of classification achieve better accuracy and coverage. In general, bag-of-words features are tightly coupled with the data they represent. On the other hand, statistical features are independent of the actual words used. Statistical features are more useful in building robust models that can be used with different languages and for different tasks.