论文信息 - Entity Matching from Unstructured and Dissimilar Data Collections: Semantic and Content Distribution Approach

Entity Matching from Unstructured and Dissimilar Data Collections: Semantic and Content Distribution Approach

This paper describes a solution to the problem of extracting data features from a collection of dissimilar, unstructured data sets, gathered from multiple data sources in the web or databases. In this work we present a method of feature extraction and normalization, aiming at closing the gap between a workable data set of uniform content, and a large collection of unstructured and un-normalized collection of unworkable data set. The feature extraction we modeled creates focused, structured data sets as output, and with Big-Data and Analytics perspective. The solution we present automates data ingestion from public data sources and it applies Machine Learning methodology to build data relationships across unstructured data sets. Our research is aiming at extracting key features by using semi-supervised process, semantic relations, and statistical analysis of the distribution of content. The mapping across dissimilar datasets is solved through matching problem of these metrics, constructing a scoring value that maps different entities. We proposed a three-layer matching process of homogenous covariates from different sources semantic and measures are nonstandard using pattern recognition. This work presents a novel way to tackle the entity resolution problem. The result shows that the method works well on real industrial data and provides immediate ROI value for the data management system.

Ravigopal Vennelakanti | Marnith Peng | Jose Luis Beltran

[1] Terrence J. Sejnowski,et al. Handling Missing Data with Variational Bayesian Learning of ICA , 2002, NIPS.

[2] Christopher Ré,et al. Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[4] Jacek Tabor,et al. Processing of missing data by neural networks , 2018, NeurIPS.

[5] Ying Zhang,et al. Multivariate Time Series Imputation with Generative Adversarial Networks , 2018, NeurIPS.

[6] Huimin Zhao,et al. Semantic matching across heterogeneous data sources , 2007, Commun. ACM.

[7] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8] Christopher Ré,et al. Fonduer: Knowledge Base Construction from Richly Formatted Data , 2017, SIGMOD Conference.

[9] David Grangier,et al. Feature Set Embedding for Incomplete Data , 2010, NIPS.