Business data fusion

Enterprise business intelligence usually relies on data from multiple sources being carefully joined based on common attributes and consolidated into a common data warehouse. This process is often plagued by difficulties and errors in resolving join-attributes across sources while consolidating information into a data warehouse. Moreover, it may often be impossible to accurately join data from diverse external data sources. Nevertheless, each such data source can still provide useful information on correlations amongst the attributes it captures, and enterprises are increasingly looking to replace the traditional data warehouse with `data lakes' based on new technology, such as Hadoop, in order to derive statistical insights. We describe an approach for `business data fusion' applicable in such a scenario: We define `distributional queries' and their utility in multiple scenarios, including for correlating diverse data sources, and show that these are equivalent to probabilistic inference. In order to efficiently execute such queries, relationships and correlations across data sources are summarized via a Bayesian network, which is learned in an expert-guided manner so as to incorporate domain knowledge. We present empirical results of our approach applied to (a) summarize large volumes of vehicular multi-sensor data in a sensor-data-lake, to efficiently provide probabilistic answers to support engineering analysis without repeatedly accessing the raw data; and (b) demonstrate how potentially diverse and unrelated public and private data sources can nevertheless be approximately and efficiently joined to derive useful statistical insights via distributional queries implemented using Bayesian inference.

[1]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[2]  Dennis M. Buede,et al.  A target identification comparison of Bayesian and Dempster-Shafer multisensor fusion , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[3]  Ashwin Srinivasan,et al.  Exploratory Data Analysis Using Alternating Covers of Rules and Exceptions , 2014, COMAD.

[4]  James Llinas,et al.  An introduction to multisensor data fusion , 1997, Proc. IEEE.

[5]  Sorin C. Popescu,et al.  Mapping surface fuel models using lidar and multispectral data fusion for fire behavior , 2008 .

[6]  James Llinas,et al.  Multisensor Data Fusion , 1990 .

[7]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[8]  Gautam Shroff,et al.  Approximate Incremental Big-Data Harmonization , 2013, 2013 IEEE International Congress on Big Data.

[9]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[10]  P. Lima,et al.  Bayesian Sensor Fusion for Cooperative Object Localization and World Modeling , 2003 .

[11]  Anil K. Jain,et al.  Multisource classification of remotely sensed data: fusion of Landsat TM and SAR images , 1994, IEEE Trans. Geosci. Remote. Sens..

[12]  Jing Li,et al.  Heterogeneous data fusion for alzheimer's disease study , 2008, KDD.

[13]  D. Roy,et al.  Multi-temporal MODIS-Landsat data fusion for relative radiometric normalization, gap filling, and prediction of Landsat data , 2008 .

[14]  H. B. Mitchell,et al.  Multi-Sensor Data Fusion: An Introduction , 2007 .

[15]  Barbara Vantaggi,et al.  Statistical matching of multiple sources: A look through coherence , 2008, Int. J. Approx. Reason..

[16]  Jixian Zhang Multi-source remote sensing data fusion: status and trends , 2010 .

[17]  Christos Faloutsos,et al.  NetCube: A Scalable Tool for Fast Data Mining and Compression , 2001, VLDB.

[18]  Alan N. Steinberg,et al.  Revisions to the JDL data fusion model , 1999, Defense, Security, and Sensing.

[19]  Felix Naumann,et al.  Data fusion , 2009, CSUR.