Record-boundary discovery in Web documents

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

[1]  David W. Embley,et al.  Object-oriented systems analysis - a model-driven approach , 1991, Yourdon Press Computing series.

[2]  Anand Rajaraman,et al.  Virtual database technology , 1997, Proceedings 14th International Conference on Data Engineering.

[3]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[4]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[5]  Craig A. Knoblock,et al.  STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources * , 1998 .

[6]  David W. Embley Programming with data frames for everyday data items , 1980, AFIPS '80.

[7]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[8]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[11]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[12]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[13]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[14]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[15]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[16]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[17]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[18]  George Luger,et al.  Artificial Intelligence: Structures and Strategies for Complex Problem Solving (5th Edition) , 2004 .