Towards a framework for attribute retrieval

In this paper, we propose an attribute retrieval approach which extracts and ranks attributes from HTML tables. We distinguish between class attribute retrieval and instance attribute retrieval. On one hand, given an instance (e.g. University of Strathclyde) we retrieve from the Web its attributes (e.g. principal, location, number of students). On the other hand, given a class (e.g. universities) represented by a set of instances, we retrieve common attributes of its instances. Furthermore, we show we can reinforce instance attribute retrieval if similar instances are available. Our approach uses HTML tables which are probably the largest source for attribute retrieval. Three recall oriented filters are applied over tables to check the following three properties: (i) is the table relational, (ii) has the table a header, and (iii) the conformity of its attributes and values. Candidate attributes are extracted from tables and ranked with a combination of relevance features. Our approach is shown to have a high recall and a reasonable precision. Moreover, it outperforms state of the art techniques.

[1]  Wai Lam,et al.  An unsupervised method for joint information extraction and feature mining across different Web sites , 2009, Data Knowl. Eng..

[2]  Wai Lam,et al.  A probabilistic approach for adapting information extraction wrappers and discovering new attributes , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[3]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[4]  Eduard H. Hovy,et al.  Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked , 2003, ACL.

[5]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[6]  Mohand Boughanem,et al.  Attribute Retrieval from Relational Web Tables , 2011, SPIRE.

[7]  Paul Thomas,et al.  Focused and aggregated search: a perspective from natural language generation , 2010, Information Retrieval.

[8]  Mohand Boughanem,et al.  Retrieving attributes using web tables , 2011, JCDL '11.

[9]  Enrique Alfonseca,et al.  Acquisition of instance attributes via labeled and related instances , 2010, SIGIR.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[11]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[12]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[13]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[14]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[15]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[16]  Oren Etzioni,et al.  Relational Web Search , 2006 .

[17]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[18]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[19]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[20]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[21]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.

[22]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[23]  Naoki Yoshinaga,et al.  Open-Domain Attribute-Value Acquisition from Semi-Structured Texts , 2007 .

[24]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[25]  Massimo Poesio,et al.  Attribute-Based and Value-Based Clustering: An Evaluation , 2004, EMNLP.

[26]  Arlind Kopliku Aggregated search: From information nuggets to aggregated documents , 2009, CORIA.

[27]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[28]  Kentaro Torisawa,et al.  Automatic Discovery of Attribute Words from Web Documents , 2005, IJCNLP.

[29]  Jun'ichi Tsujii,et al.  A method to integrate tables of the World Wide Web , 2001 .