A Unifying Approach to HTML Wrapper Representation and Learning

The number, the size, and the dynamics of Internet information sources bears abundant evidence of the need for automation in information extraction. This calls for representation formalisms that match the World Wide Web reality and for learning approaches and learnability results that apply to these formalisms. The concept of elementary formal systems is appropriately generalized to allow for the representation of wrapper classes which are relevant to the description of Internet sources in HTML format. Related learning results prove that those wrappers are automatically learnable from examples. This is setting the stage for information extraction from the Internet by exploitation of inductive learning techniques.

[1]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[2]  Bernd Thomas Anti-Unification Based Learning of T-Wrappers for Information Extraction , 1999 .

[3]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[4]  J. Lloyd Foundations of Logic Programming , 1984, Symbolic Computation.

[5]  James B. Morris Formal Languages and their Relation to Automata , 1970 .

[6]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[7]  Takeshi Shinohara,et al.  Rich Classes Inferable from Positive Data: Length-Bounded Elementary Formal Systems , 1994, Inf. Comput..

[8]  Raymond M. Smullyan,et al.  Theory of Formal Systems. (AM-47) , 1961 .

[9]  Akihiro Yamamoto Elementary Formal System as a Logic Programming Language , 1989, LP.

[10]  Bernd Thomas,et al.  Logic Programs for Intelligent Web Search , 1999, ISMIS.

[11]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[12]  Akihiro Yamamoto,et al.  Learning Elementary Formal Systems , 1992, Theor. Comput. Sci..

[13]  Akihiro Yamamoto,et al.  Algorithmic Learning Theory with Elementary Formal Systems , 1992 .

[14]  Jeffrey D. Ullman,et al.  Formal languages and their relation to automata , 1969, Addison-Wesley series in computer science and information processing.

[15]  Setsuo Arikawa,et al.  Applying Inverse Resolution to EFS Language Learning , 1999 .

[16]  Thomas Zeugmann,et al.  A Guided Tour Across the Boundaries of Learning Recursive Languages , 1995, GOSLER Final Report.

[17]  V. Lifschitz,et al.  Foundations of Logic Programming , 1997 .

[18]  Akihiro Yamamoto,et al.  Elementary formal system as a unifying framework for language learning , 1989, COLT '89.

[19]  R. Smullyan Theory of formal systems , 1962 .