RoadRunner: automatic data extraction from data-intensive web sites

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a web page, and reorganizes them in a more structured format. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages [1]. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a dataintensive web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique [2], that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching. Our demonstration presents RoadRunner, our prototype that implements matching and its companion techniques. We have conducted several experiments on pages from real life web sites; these experiences have shown the effectiveness of the approach, as well as the efficiency of the system [2]. The matching technique for wrapper inference [2] is based on an iterative process; at every step, matching works on two objects at a time: (i) an input page, which represented as a list of tokens (each token is either a tag or a text field), and (ii) a wrapper, expressed as a regular expression. The process starts by taking one input page as an initial version of the wrapper; then, the wrapper is matched against the sample and it is progressively refined trying to solve mismatches: a mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper. Mismatches can be solved by generalizing the wrapper. The process succeeds if a common wrapper can be generated by solving all mismatches encountered.