Unsupervised Anomaly Detection

This paper describes work on the detection of anomalous material in text. We show several variants of an automatic technique for identifying an 'unusual' segment within a document, and consider texts which are unusual because of author, genre [Biber, 1998], topic or emotional tone. We evaluate the technique using many experiments over large document collections, created to contain randomly inserted anomalous segments. In order to successfully identify anomalies in text, we define more than 200 stylistic features to characterize writing, some of which are well-established stylistic determiners, but many of which are novel. Using these features with each of our methods, we examine the effect of segment size on our ability to detect anomaly, allowing segments of size 100 words, 500 words and 1000 words. We show substantial improvements over a baseline in all cases for all methods, and identify the method variant which performs consistently better than others.