Instance-Based Learning for Tweet Monitoring and Categorization

The CLEF RepLab 2014 Track was the occasion to investigate the robustness of instance-based learning in a complete system for tweet monitoring and categorization based. The algorithm we implemented was a k-Nearest Neighbors. Dealing with the domain automotive or banking and the language English or Spanish, the experiments showed that the categorizer was not affected by the choice of representation: even with all learning tweets merged into one single Knowledge Base KB, the observed performances were close to those with dedicated KBs. Interestingly, English training data in addition to the sparse Spanish data were useful for Spanish categorization +14% for accuracy for automotive, +26% for banking. Yet, performances suffered from an overprediction of the most prevalent category. The algorithm showed the defects of its virtues: it was very robust, but not easy to improve. BiTeM/SIBtex tools for tweet monitoring are available within the DrugsListener Project page of the BiTeM website http://bitem.hesge.ch/.