Kimatu , a tool for cleaning non-content text parts from HTML docs
This paper explains the functionality of Kimatu, a tool to extract authentic content text from HTML docs –a necessary task to remove linguisticall y uninteresting text parts–. The system’s algorithm consists of a bootstrapping process based in several heuristics and formed by several steps. First, it identifies text blocks that have t he same appearance, and calculate for each block a ratio that linearly combines various other ratios useful to measure content-richness. Then it uses blocks with high ratios as references and it s equentially applies other heuristics in order to detect the rest of the candidates and reject some c learly non-relevant blocks.