Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules

This paper presents SoftMealy, a novel Web wrapper representation formalism. This representation is based on a finite-state transducer (FST) and contextual rules, which allow a wrapper to wrap semistructured Web pages containing missing attributes, multiple attribute values, variant attribute permutations, exceptions and typos, the features that no previous work can handle. A SoftMealy wrapper can be learned from labeled example items using a simple induction algorithm. Learnability analysis shows that SoftMealy scales well with the number of attributes and the number of different attribute permutations. Experimental results show that the learning algorithm can learn correct wrappers for a wide range of Web pages with a handful of examples and generalize well over unseen pages and structural patterns.