Twitter bot detection & categorization - a comparative study of machine learning methods
暂无分享,去创建一个
Automated Twitter accounts, or Twitter bots, have gained increased attention lately. In particular, a novel type of bot, so called social bots, are piquing peoples’ interest, as these bots have recently been involved in a number of political events. Most of the previous work on how to detect bots has not distinguished between such novel bot types, and other, less sophisticated ones. Instead, they have all been lumped together, producing models to detect automated behaviour in general. Although indeed useful, the above approach might cause issues when the task at hand concerns one particular type of bot account. This thesis therefore attempts at making the classification of bots more fine grained, viewing social bots, traditional spambots, and fake followers as three separate bot categories, to be classified together with the category for actual human users (called ”genuine users”). Four machine learning methods are trained and com- pared for this purpose, showing that the random forest performs slightly better than the rest in all performance measures used. However, all mod- els yield an overall accuracy above 90%, which is relatively high compared to similar studies in the field. The analysis also indicates that data sam- pling has been biased, skewing the data to yield some unexpected results. For instance, genuine users show much more activity than would be expected of the average human-controlled Twitter account. Additionally, traditional bots, which are supposed to be the easiest to classify, instead appear to be the opposite. If the data sampling has indeed been biased, the validity of the models trained on this skewed data is called into question. Hence, more research into sampling techniques is suggested, and it is concluded that the models produced should be tested on more diverse datasets. Without these kinds of repeated studies, the impact of the sup- posed sampling bias, and consequently the usefulness of the models in real world situations, cannot be properly assessed. (Less)