Sato: Contextual Semantic Type Detection in Tables

Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes for training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  Nan Ye,et al.  Conditional random field with high-order dependencies for sequence labeling and segmentation , 2014, J. Mach. Learn. Res..

[3]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[4]  Noah A. Smith Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[5]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[6]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[7]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[8]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Genifer Snipes,et al.  Google Data Studio , 2018 .

[11]  Stuart Geman,et al.  Markov Random Field Image Models and Their Applications to Computer Vision , 2010 .

[12]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[13]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[14]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[16]  Xing Xie,et al.  Discovering regions of different functions in a city using human mobility and POIs , 2012, KDD.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Nikhil Puranik A specialist approach for the classification of column data , 2012 .

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Craig A. Knoblock,et al.  Semantic Labeling: A Domain-Independent Approach , 2016, SEMWEB.

[24]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[26]  Yang Liu,et al.  Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[31]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[32]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[33]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[34]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[35]  François Yvon,et al.  Learning the Structure of Variable-Order CRFs: a finite-state perspective , 2017, EMNLP.

[36]  Cong Yan,et al.  Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code , 2018, SIGMOD Conference.

[37]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[38]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[39]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[40]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[41]  Exploiting Structure within Data for Accurate Labeling using Conditional Random Fields , 2012 .

[42]  Craig A. Knoblock,et al.  Assigning Semantic Labels to Data Sources , 2015, ESWC.

[43]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[44]  Tim Finin,et al.  Exploiting a Web of Semantic Data for Interpreting Tables , 2010 .

[45]  Shinji Nakadai,et al.  Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables , 2019, AAAI.

[46]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Xuanjing Huang,et al.  Sparse higher order conditional random fields for improved sequence labeling , 2009, ICML '09.

[49]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[50]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[51]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[52]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .