Wrapper Induction for Information Extraction

Many Internet information resources present relational data|telephone directories, product catalogs, etc. Because these sites are formatted for people, mechanically extracting their content is di cult. Systems using such resources typically use hand-coded wrappers, procedures to extract data from information resources. We introduce wrapper induction, a method for automatically constructing wrappers, and identify hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources. We use PAC analysis to bound the problem's sample complexity, and show that the system degrades gracefully with imperfect labeling knowledge.

[1]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[2]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[3]  Tim Finin,et al.  KQML - A Language and Protocol for Knowledge and Information Exchange , 1994 .

[4]  Amit P. Sheth,et al.  The " InfoHarness" Information Integration Platform , 1994 .

[5]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[6]  Patrick Valduriez,et al.  Using Heterogeneous Equivalences for Query Rewriting in Multidatabase Systems , 1995, CoopIS.

[7]  James A. Hendler,et al.  Ontology-based Web agents , 1997, AGENTS '97.

[8]  Amar Gupta,et al.  Integration of Information Systems: Bridging Heterogeneous Databases , 1989 .

[9]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[10]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[11]  Dayne Freitag Machine Learning for Information Extraction from Online Documents , 1996 .

[12]  Stephen Glenn Soderland,et al.  Learning text analysis rules for domain-specific natural language processing , 1996 .

[13]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[14]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[15]  Alan F. Smeaton,et al.  Relevance Feedback and Query Expansion for Searching the Web: A Model for Searching a Digital Library , 1997, ECDL.

[16]  Oren Etzioni,et al.  A softbot-based interface to the Internet , 1994, CACM.

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[18]  Noriyuki Tanida,et al.  Polynomial-Time Identification of Strictly Regular Languages in the Limit , 1992 .

[19]  Oren Etzioni,et al.  Building Softbots for UNIX (Preliminary Report) , 1992 .

[20]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[21]  Jiawei Han,et al.  Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment , 1995, KDD.

[22]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[23]  Shona Douglas,et al.  Layout and Language: lists and tables in technical documents , 1996 .

[24]  Katia P. Sycara,et al.  Middle-Agents for the Internet , 1997, IJCAI.

[25]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[26]  Peter B. Danzig,et al.  Scalable Internet Discovery: Research Problems and Approaches. , 1994 .

[27]  Michael Stonebraker,et al.  Migrating Legacy Systems: Gateways, Interfaces, and the Incremental Approach , 1995 .

[28]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[29]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[30]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[31]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[32]  Oren Etzioni,et al.  Intelligence without Robots: A Reply to Brooks , 1993, AI Mag..

[33]  I. G. BONNER CLAPPISON Editor , 1960, The Electric Power Engineering Handbook - Five Volume Set.

[34]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[35]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[36]  Peter L. Bartlett,et al.  Investigating the distribution assumptions in the Pac learning model , 1991, COLT '91.

[37]  Chun-Nan Hsu,et al.  SIMS: Single Interface to Multiple Sources , 1996 .

[38]  Devika Subramanian,et al.  Customizing information capture and access , 1997, TOIS.

[39]  Yoram Singer,et al.  Learning to Query the Web , 1996 .

[40]  Martin Anthony,et al.  Computational learning theory: an introduction , 1992 .

[41]  Joann J. Ordille,et al.  Query-Answering Algorithms for Information Agents , 1996, AAAI/IAAI, Vol. 1.

[42]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[43]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[44]  Boris Chidlovskii,et al.  Towards Sophisticated Wrapping of Web-based information Repositories , 1997, RIAO.

[45]  Rosario Gennaro,et al.  On learning from noisy and incomplete examples , 1995, COLT '95.

[46]  L. F. Rau,et al.  Extracting company names from text , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.

[47]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[48]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[49]  Jerome A. Feldman,et al.  On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[50]  E. Monge,et al.  The Eld Matching Problem: Algorithms and Applications , 1996 .

[51]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[52]  Peter H. Aiken,et al.  Data Reverse Engineering : Slaying the Legacy Dragon , 1995 .

[53]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[54]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[55]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[56]  Dale Schuurmans,et al.  Practical PAC Learning , 1995, IJCAI.

[57]  Dana Angluin,et al.  Computational learning theory: survey and selected bibliography , 1992, STOC '92.

[58]  Christine Collet,et al.  Resource integration using a large knowledge base in Carnot , 1991, Computer.

[59]  Dayne Freitag,et al.  Using grammatical inference to improve precision in information extraction , 1997, ICML 1997.

[60]  Edward A. Green,et al.  Model-based analysis of printed tables , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[61]  Christine L. Borgman,et al.  Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[62]  Oren Etzioni,et al.  Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.

[63]  Jerry R. Hobbs The Generic Information Extraction System , 1993, MUC.

[64]  Craig A. Knoblock,et al.  Learning Models for Multi-Source Integration , 1996, AAAI/IAAI, Vol. 2.