Topical anomaly detection from Twitter stream

In this paper, we spot topically anomalous tweets in twitter streams by analyzing the content of the document pointed to by the URLs in the tweets in preference to their textual content. Existing approaches to anomaly detection ignore such URLs thereby missing opportunities to detect off-topic tweets. Specifically, we determine the divergence of claimed topic of a tweet as reflected by the hashtags and the actual topic as reflected by the referenced document content. Our approach avoids the need for labeled samples by selecting documents from reliable sources gleaned from the URLs present in the tweets. These documents are used for comparison against documents associated with unknown URLs in incoming tweets improving reliability, scalability and adaptability to rapidly changing topics. We evaluate our approach on three events and show that it can find topical inconsistencies not detectable by existing approaches.