论文信息 - Exploring structure and content on the web: extraction and integration of the semi-structured web

Exploring structure and content on the web: extraction and integration of the semi-structured web

In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.

Jiawei Han | Tim Weninger

[1] Jiawei Han,et al. CETR: content extraction via tag ratios , 2010, WWW '10.

[2] Sunita Sarawagi,et al. Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[5] Lorenzo Blanco,et al. Flint: Google-basing the Web , 2008, EDBT '08.

[6] Donato Malerba,et al. HyLiEn: a hybrid approach to general list extraction on the web , 2011, WWW.

[7] Jiawei Han,et al. Document-topic hierarchies from document graphs , 2012, CIKM.

[8] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[9] Bing Liu,et al. Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10] Rahul Gupta,et al. Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[11] Jiawei Han. Construction of Web-Based, Service-Oriented Information Networks: A Data Mining Perspective - (Abstract) , 2012, WAIM.

[12] Jayant Madhavan,et al. Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[13] Jiawei Han,et al. Building enriched web page representations using link paths , 2012, HT '12.