Ranking XPaths for extracting search result records

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

[1]  Sachio Hirokawa,et al.  Testbed for information extraction from deep web , 2004, WWW Alt. '04.

[2]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[3]  Vijay V. Raghavan,et al.  AllInOneNews: development and evaluation of a large-scale news metasearch engine , 2007, SIGMOD '07.

[4]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[5]  Ji-Rong Wen,et al.  Efficient record-level wrapper induction , 2009, CIKM.

[6]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[7]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers , 2002, NetObjectDays.

[8]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[9]  James A. Thom,et al.  Entity Extraction from the Web with WebKnox , 2010 .

[10]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[11]  Jer Lang Hong,et al.  Information extraction for search engines using fast heuristic techniques , 2010, Data Knowl. Eng..

[12]  Michael L. Nelson,et al.  Search engines and their public interfaces: which apis are the most synchronized? , 2007, WWW '07.

[13]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[14]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[15]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[16]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[17]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[18]  M. de Rijke,et al.  Automatic Wrapper Generation for Web Search Engines , 2000, Web-Age Information Management.

[19]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[20]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[21]  Maybin K. Muyeba,et al.  A Method for Web Information Extraction , 2008, APWeb.

[22]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[23]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[24]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[25]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[26]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[27]  Weifeng Su,et al.  ODE: Ontology-assisted data extraction , 2009, TODS.

[28]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[29]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[30]  Ji-Rong Wen,et al.  Pictor: an interactive system for importing data from a website , 2008, KDD.

[31]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[32]  Quang-Thuy Ha,et al.  XPath-Wrapper Induction for Data Extraction , 2010, 2010 International Conference on Asian Language Processing.

[33]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[34]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[35]  Rainer Unland,et al.  Objects, Components, Architectures, Services, and Applications for a Networked World , 2003, Lecture Notes in Computer Science.

[36]  Jussi Myllymaki,et al.  Robust Web Data Extraction with XML Path Expressions , 2002 .

[37]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[38]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[39]  Fidel Cacheda,et al.  Extracting lists of data records from semi-structured web pages , 2008, Data Knowl. Eng..

[40]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[41]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[42]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[43]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[44]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[45]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[46]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[47]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[48]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.