The parallel path framework for entity discovery on the web

It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose a Web structure mining method which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.

[1]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[2]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[3]  Bo Zhao,et al.  Entity relation discovery from web tables and links , 2010, WWW '10.

[4]  Jiawei Han,et al.  Building enriched web page representations using link paths , 2012, HT '12.

[5]  Donato Malerba,et al.  Mapping web pages to database records via link paths , 2010, CIKM.

[6]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[7]  J. Y. Yen,et al.  Finding the K Shortest Loopless Paths in a Network , 2007 .

[8]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[9]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[10]  Donato Malerba,et al.  Unexpected results in automatic list extraction on the web , 2011, SKDD.

[11]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[12]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[14]  Lorenzo Blanco,et al.  Efficiently Locating Collections of Web Pages to Wrap , 2005, WEBIST.

[15]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Lorenzo Blanco,et al.  Flint: Google-basing the Web , 2008, EDBT '08.

[17]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[18]  Yizhou Sun,et al.  WINACS: construction and analysis of web-based computer science information networks , 2011, SIGMOD '11.

[19]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[20]  Birger Andersson,et al.  Natural Language Processing and Information Systems , 2003, Lecture Notes in Computer Science.

[21]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[22]  J. Y. Yen Finding the K Shortest Loopless Paths in a Network , 1971 .

[23]  HalevyAlon,et al.  Harvesting relational tables from lists on the web , 2009, VLDB 2009.

[24]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[25]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[26]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[27]  Soo-Min Kim,et al.  Improving web page classification by label-propagation over click graphs , 2009, CIKM.

[28]  Donato Malerba,et al.  Growing parallel paths for entity-page discovery , 2011, WWW.

[29]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  Grace Hui Yang,et al.  Effectiveness of web page classification on finding list answers , 2004, SIGIR '04.

[31]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[32]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[33]  Ben Choi,et al.  Web Page Classification , 2005 .

[34]  Jaap Kamps,et al.  Entity ranking using Wikipedia as a pivot , 2010, CIKM.

[35]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[36]  Donato Malerba,et al.  HyLiEn: a hybrid approach to general list extraction on the web , 2011, WWW.

[37]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[38]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[39]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[40]  Lorenzo Blanco,et al.  Supporting the automatic construction of entity aware search engines , 2008, WIDM '08.

[41]  Mukesh K. Mohania,et al.  Towards automatic association of relevant unstructured content with structured query results , 2005, CIKM '05.