论文信息 - Ranking XPaths for extracting search result records

Ranking XPaths for extracting search result records

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

[1] Sachio Hirokawa,et al. Testbed for information extraction from deep web , 2004, WWW Alt. '04.

[2] Clement T. Yu,et al. Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[3] Vijay V. Raghavan,et al. AllInOneNews: development and evaluation of a large-scale news metasearch engine , 2007, SIGMOD '07.

[4] Valter Crescenzi,et al. Automatic information extraction from large websites , 2004, JACM.

[5] Ji-Rong Wen,et al. Efficient record-level wrapper induction , 2009, CIKM.

[6] Tobias Dönz. Extracting Structured Data from Web Pages , 2003 .

[7] Stefan Kuhlins,et al. Toolkits for Generating Wrappers , 2002, NetObjectDays.

[8] Valter Crescenzi,et al. Grammars Have Exceptions , 1998, Inf. Syst..

[9] James A. Thom,et al. Entity Extraction from the Web with WebKnox , 2010 .

[10] Craig A. Knoblock,et al. Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[11] Jer Lang Hong,et al. Information extraction for search engines using fast heuristic techniques , 2010, Data Knowl. Eng..

[12] Michael L. Nelson,et al. Search engines and their public interfaces: which apis are the most synchronized? , 2007, WWW '07.

[13] Dayne Freitag,et al. Multistrategy Learning for Information Extraction , 1998, ICML.

[14] F. E.. A Relational Model of Data Large Shared Data Banks , 2000 .

[15] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[16] Wei-Ying Ma,et al. Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[17] Calton Pu,et al. A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[18] M. de Rijke,et al. Automatic Wrapper Generation for Web Search Engines , 2000, Web-Age Information Management.

[19] Arnaud Sahuguet,et al. Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[20] Ravi Kumar,et al. Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[21] Maybin K. Muyeba,et al. A Method for Web Information Extraction , 2008, APWeb.

[22] Frederick H. Lochovsky,et al. Data extraction and label assignment for web databases , 2003, WWW '03.

[23] Dayne Freitag,et al. Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[24] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[25] Brad Adelberg,et al. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[26] Louise E. Moser,et al. Extracting data records from the web using tag path clustering , 2009, WWW '09.

[27] Weifeng Su,et al. ODE: Ontology-assisted data extraction , 2009, TODS.

[28] David W. Embley,et al. Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[29] Stephen Soderland,et al. Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[30] Ji-Rong Wen,et al. Pictor: an interactive system for importing data from a website , 2008, KDD.

[31] Tobias Anton. XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[32] Quang-Thuy Ha,et al. XPath-Wrapper Induction for Data Extraction , 2010, 2010 International Conference on Asian Language Processing.

[33] Chia-Hui Chang,et al. IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[34] Dimitrios Skoutas,et al. STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[35] Rainer Unland,et al. Objects, Components, Architectures, Services, and Applications for a Networked World , 2003, Lecture Notes in Computer Science.

[36] Jussi Myllymaki,et al. Robust Web Data Extraction with XML Path Expressions , 2002 .

[37] Vijay V. Raghavan,et al. Fully automatic wrapper generation for search engines , 2005, WWW '05.

[38] Khaled Shaalan,et al. A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[39] Fidel Cacheda,et al. Extracting lists of data records from semi-structured web pages , 2008, Data Knowl. Eng..

[40] Alberto O. Mendelzon,et al. WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[41] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[42] Georg Gottlob,et al. Visual Web Information Extraction with Lixto , 2001, VLDB.

[43] Wei Liu,et al. ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[44] Hector Garcia-Molina,et al. Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[45] Bing Liu,et al. Web data extraction based on partial tree alignment , 2005, WWW '05.

[46] Nilesh N. Dalvi,et al. Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[47] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[48] Georg Lausen,et al. ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.