SEARCHING THE LONG TAIL: HIDDEN STRUCTURE IN SOCIAL TAGGING

In this paper we explore a method of decomposition of compound tags found in social tagging systems and outline several results, including improvement of search indexes, extraction of semantic information, and benefits to usability. Analysis of tagging habits demonstrates that social tagging systems such as del.icio.us and flickr include both formal metadata, such as geotags, and informally created metadata, such as annotations and descriptions. The majority of tags represent informal metadata; that is, they are not structured according to a formal model, nor do they correspond to a formal ontology. Statistical exploration of the main tag corpus demonstrates that such searches use only a subset of the available tags; for example, many tags are composed as ad hoc compounds of terms. In order to improve accuracy of searching across the data contained within these tags, a method must be employed to decompose compounds in such a way that there is a high degree of confidence in the result. An approach to decomposition of English-language compounds, designed for use within a small initial sample tagset, is described. Possible decompositions are identified from a generous wordlist, subject to selective lexicon snipping. In order to identify the most likely, a Bayesian classifier is used across term elements. To compensate for the limited sample set, a word classifier is employed and the results classified using a similar method, resulting in a successful classification rate of 88%, and a false negative rate of only 1%.