Mining user-generated geographic content: an interactive, crowdsourced approach to validation and supervision

This paper describes a pilot study that implements a novel approach to validate data mining tasks by using the crowd to train a classifier. This hybrid approach to processing successfully addresses challenges faced during human curation or machine processing of user-generated geographic content (UGGC), namely quality control, reproducibility, sustainability, scaling, data quality, overfitting, and training costs. We test the approach on mining UGGC to derive information on local places as humans perceive them. Specifically, we retrieve Flickr image metadata, enrich it semantically by building term vectors using a controlled vocabulary, cluster it spatially, let online participants rate those clusters, classify them into noise and places by using both semantic and cluster characteristics, let online participants supervise the classification by annotating the results, and use their feedback to improve clustering and revise the trained model. The results show that the approach is feasible and suggest future studies to improve it, while also indicating that mining places from UGGC requires more than a single source.