Column Concept Determination for Chinese Web Tables via Convolutional Neural Network

Hundreds of millions of tables on the Internet contain a considerable wealth of high-quality relational data. However, the web tables tend to lack explicit key semantic information. Therefore, information extraction in tables is usually supplemented by recovering the semantics of tables, where column concept determination is an important issue. In this paper, we focus on column concept determination in Chinese web tables. Different from previous research works, convolutional neural network (CNN) was applied in this task. The main contributions of our work lie in three aspects: firstly, datasets were constructed automatically based on the infoboxes in Baidu Encyclopedia; secondly, to determine the column concepts, a CNN classifier was trained to annotate cells in tables and the majority vote method was used on the columns to exclude incorrect annotations; thirdly, to verify the effectiveness, we performed the method on the real tabular dataset. Experimental results show that the proposed method outperforms the baseline methods and achieves an average accuracy of 97% for column concept determination.

[1]  Dominique Ritze,et al.  Matching HTML Tables to DBpedia , 2015, WIMS.

[2]  Gianluca Quercini,et al.  Entity discovery and annotation in tables , 2013, EDBT '13.

[3]  Jian Li,et al.  Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases , 2013, Proc. VLDB Endow..

[4]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[5]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[6]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[7]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[8]  Dominique Ritze,et al.  Matching Web Tables To DBpedia - A Feature Utility Study , 2017, EDBT.

[9]  Ziqi Zhang,et al.  Towards Efficient and Effective Semantic Table Interpretation , 2014, SEMWEB.

[10]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[11]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[12]  Oktie Hassanzadeh,et al.  Understanding a large corpus of web tables through matching with knowledge bases: an empirical study , 2015, OM.

[13]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[14]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[15]  Karl Aberer,et al.  Result selection and summarization for Web Table search , 2015, 2015 IEEE 31st International Conference on Data Engineering.