A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

In recent years, many articles have been published about the study of user-generated content (UGC) data in the domains of tourism and hospitality, in particular concerning quantitative and qualitative content analysis of travel blogs and online travel reviews (OTR). In general, researchers have worked on more or less population-representative samples of travel diaries, of tens or hundreds of files, which enables their manual processing. However, due to their dramatic growth, especially in the case of hospitality OTRs, this article proposes a method for semi-automatic downloading, arranging, cleaning, debugging, and analysing large-scale travel blog and OTR data. The main goal is to classify the collected webpages by dates and destinations and to be able to perform offline content analysis of the written text as provided by the author. This methodology is applied to analyse about 85,000 diaries of tourists who visited Catalonia between 2004 and 2013, and significant results are obtained in terms of content analysis.