Web Content Extraction through Histogram Clustering

We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-To-Tag Ratio (TTR) rather than specific HTML cues that are not constant across various Web pages. We describe how to compute the TTR on a line-by-line basis and then cluster the results into content and non-content areas. The resulting TTR-histogram is not easily clustered because of its one dimensionality; therefore we present a technique to better represent the histogram in two-dimensions. Next, we compare clustering techniques such as EM, K-Means, and Farthest First – in density and distance modes – with a threshold partitioning technique on the resulting two-dimensional data. These clustering techniques are also enhanced with the use of histogram smoothing techniques. We then evaluate our approach using standard accuracy, precision and recall metrics. INTRODUCTION The amount of information being gathered and stored on the Internet continues to increase. The artifacts of this growing market provide interesting new research opportunities that explore social interactions, language, art, mathematics, etc. Many of these new research opportunities require the content of the Internet to be gathered, processed and stored quickly and efficiently. This effort is often hampered by the use of structure tags in HTML and XML. These tags are meaningful only to the browser that renders the document, but bear little semantic meaning to the end user. Tags and other non-content related HTML characters – images not included – comprise the majority of each page’s size (Lu, et al. 2004), and yet, Internet researchers are forced to crawl, compute and store web content in their entirety. This work focuses on extracting content from Web pages that are otherwise laden with structural data, links and advertisements, commonly called Text Extraction (Soderland 1997). This work is particularly challenging because of the difficulty in determining which part of a web page is meaningful and which part is not. In this paper, we extend our previous work on Web content extraction with the use of the Text-To-Tag Ratio (TTR). The TTR approach to Web content extraction makes no assumptions about the particular structure of a given Web page, nor does it look for particular cues such as specific HTML tags, etc. as previous research does. The only necessary pre-condition of a page’s structure is that it has some structure. With this in mind, we construct a TTR-array with the contention that for each line k in the array, the higher the TTR is for the element k relative to the mean TTR of the entire array the more likely that k represents a line of content-text within the HTML document. In this and in previous work (Weninger et al. 2008), we observe that the TTR-array closely resembles a histogram, in that each histogram bucket represents the TTR of a line