A Grammar Inference Algorithm for the World Wide Web

The World Wide Web is a treasure trove of information. The Web’s sheer scale m,l~.s automatic location and extraction of information appealing. However, much of the information lies bmied in documents designed for human consumption, such as home pages or product ~ta_Sogs. Before software agents can extract nuggets of infonnmion fi’om Web documents, they have to be able to recognize it despite the multitude of formats in wh/ch it may appear. In this paper, we take a machine learning approach to the problem. We explain why existing grammar inference techniques face difficulties in this domain, present a new techn/que, and demonstrate its success on examples drawn f~om the Web ranging f~om CMU Tech Report codes to bus schedules. Our algorithm is shown to learn target languages found on the Web in si~mlfw.aufly fewer examples than Inevious methods. In addition, our algmiH~n is guaranteed to learn in the limit, and rims in time OOS~, where ISI is the size of the sample.

[1]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[2]  Laurent Miclet,et al.  Structural Methods in Pattern Recognition , 1986 .

[3]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[5]  A. Reber Implicit learning of artificial grammars , 1967 .

[6]  Jerome A. Feldman,et al.  On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[7]  Taylor L. Booth,et al.  Inference of Finite-State Probabilistic Grammars , 1977, IEEE Transactions on Computers.

[8]  Oren Etzioni,et al.  A softbot-based interface to the Internet , 1994, CACM.

[9]  King-Sun Fu,et al.  Syntactic Pattern Recognition And Applications , 1968 .

[10]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[11]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[12]  Jeffrey C. Schlimmer,et al.  Software Agents: Completing Patterns and Constructing User Interfaces , 1993, J. Artif. Intell. Res..

[13]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[14]  Michael G. Thomason,et al.  Syntactic Pattern Recognition, An Introduction , 1978, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[16]  Mona L. Scott,et al.  Conversion Tables: LC-Dewey, Dewey-LC , 1993 .

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .