Methodologies for Cross-Domain Data Fusion: An Overview

Traditional data mining usually deals with data from a single domain. In the big data era, we face a diversity of datasets from different sources in different domains. These datasets consist of multiple modalities, each of which has a different representation, distribution, scale, and density. How to unlock the power of knowledge from multiple disparate (but potentially connected) datasets is paramount in big data research, essentially distinguishing big data from traditional data mining tasks. This calls for advanced techniques that can fuse knowledge from various datasets organically in a machine learning and data mining task. This paper summarizes the data fusion methodologies, classifying them into three categories: stage-based, feature level-based, and semantic meaning-based data fusion methods. The last category of data fusion methods is further divided into four groups: multi-view learning-based, similarity-based, probabilistic dependency-based, and transfer learning-based methods. These methods focus on knowledge fusion rather than schema mapping and data merging, significantly distinguishing between cross-domain data fusion and traditional data fusion studied in the database community. This paper does not only introduce high-level principles of each category of methods, but also give examples in which these techniques are used to handle real big data problems. In addition, this paper positions existing works in a framework, exploring the relationship and difference between different data fusion methods. This paper will help a wide range of communities find a solution for data fusion in big data projects.

[1]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[2]  Zhongfei Zhang,et al.  Discriminative feature selection for multi-view cross-domain learning , 2013, CIKM.

[3]  Yu Zheng,et al.  U-Air: when urban air quality inference meets big data , 2013, KDD.

[4]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[5]  Jingrui He,et al.  A Graphbased Framework for Multi-Task Multi-View Learning , 2011, ICML.

[6]  Yong Yu,et al.  Inferring gas consumption and pollution emission of vehicles throughout a city , 2014, KDD.

[7]  Yu Zheng,et al.  Travel time estimation of a path using sparse trajectories , 2014, KDD.

[8]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[9]  Hui Xiong,et al.  Sparse Real Estate Ranking with Online User Reviews and Offline Moving Behaviors , 2014, 2014 IEEE International Conference on Data Mining.

[10]  Wei-Ying Ma,et al.  A Cloud-Based Knowledge Discovery System for Monitoring Fine-Grained Air Quality , 2014 .

[11]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[12]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[13]  Hui Xiong,et al.  Multi-task Multi-view Learning for Heterogeneous Tasks , 2014, CIKM.

[14]  N. Galatsanos,et al.  A TUTORIAL ON RELEVANCE VECTOR MACHINES FOR REGRESSION AND CLASSIFICATION WITH APPLICATIONS , 2006 .

[15]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[16]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[17]  Xing Xie,et al.  Towards mobile intelligence: Learning from GPS history data for collaborative recommendation , 2012, Artif. Intell..

[18]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Zhibin Hong,et al.  Tracking via Robust Multi-task Multi-view Joint Sparse Representation , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Ming Li,et al.  Forecasting Fine-Grained Air Quality Based on Big Data , 2015, KDD.

[22]  Daniel Lemire,et al.  Slope One Predictors for Online Rating-Based Collaborative Filtering , 2007, SDM.

[23]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[24]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Xing Xie,et al.  Discovering regions of different functions in a city using human mobility and POIs , 2012, KDD.

[26]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[27]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[28]  Licia Capra,et al.  Urban Computing: Concepts, Methodologies, and Applications , 2014, TIST.

[29]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training , 2005, IJCAI.

[30]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[31]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[32]  Xing Xie,et al.  Collaborative Filtering Meets Mobile Recommendation: A User-Centered Approach , 2010, AAAI.

[33]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[34]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Nicholas Jing Yuan,et al.  Sensing the Pulse of Urban Refueling Behavior , 2015, ACM Trans. Intell. Syst. Technol..

[36]  Hui Xiong,et al.  Discovering Urban Functional Zones Using Latent Activity Trajectories , 2015, IEEE Transactions on Knowledge and Data Engineering.

[37]  Zhu Wang,et al.  Discovering and Profiling Overlapping Communities in Location-Based Social Networks , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[38]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[40]  Yu Zheng,et al.  Detecting collective anomalies from multiple spatio-temporal datasets across different domains , 2015, SIGSPATIAL/GIS.

[41]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[42]  Xing Xie,et al.  Learning travel recommendations from user-generated GPS traces , 2011, TIST.

[43]  권홍우,et al.  Bootstrapping , 2002, ACL.

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[46]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[47]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[48]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[49]  Yu Zheng,et al.  Trajectory Data Mining , 2015, ACM Trans. Intell. Syst. Technol..

[50]  Xing Xie,et al.  Finding similar users using category-based location history , 2010, GIS '10.

[51]  Naoki Abe,et al.  Collaborative Filtering Using Weighted Majority Prediction Algorithms , 1998, ICML.

[52]  Cyrus Shahabi,et al.  Crowd sensing of traffic anomalies based on human mobility and social media , 2013, SIGSPATIAL/GIS.

[53]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[54]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[55]  Sanjay Chawla,et al.  Inferring the Root Cause in Road Traffic Anomalies , 2012, 2012 IEEE 12th International Conference on Data Mining.

[56]  Chang Wang,et al.  Manifold Alignment , 2011 .

[57]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Classification , 2011, AAAI.

[58]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[60]  Wei Gao,et al.  Multi-View Discriminant Transfer Learning , 2013, IJCAI.

[61]  Yanchi Liu,et al.  Diagnosing New York city's noises with ubiquitous data , 2014, UbiComp.

[62]  Linda C. van der Gaag,et al.  Probabilistic Graphical Models , 2014, Lecture Notes in Computer Science.

[63]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Clustering via the SocialWeb , 2009, ACL.

[64]  Dan Zhang,et al.  Multi-view transfer learning with a large margin approach , 2011, KDD.

[65]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[66]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[67]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[68]  Shou-De Lin,et al.  Inferring Air Quality for Station Location Recommendation Based on Urban Big Data , 2015, KDD.

[69]  Wei-Ying Ma,et al.  Recommending friends and locations based on individual location history , 2011, ACM Trans. Web.

[70]  Nicholas Jing Yuan,et al.  We know how you live: exploring the spectrum of urban lifestyles , 2013, COSN '13.

[71]  Mohamed F. Mokbel,et al.  Location-based and preference-aware recommendation using sparse geo-social networking data , 2012, SIGSPATIAL/GIS.

[72]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[73]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[74]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[75]  Xing Xie,et al.  Inferring social ties between users with human location history , 2014, J. Ambient Intell. Humaniz. Comput..

[76]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[78]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[79]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[80]  Michael I. Jordan Graphical Models , 1998 .

[81]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[82]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[83]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[84]  Xing Xie,et al.  Collaborative location and activity recommendations with GPS history data , 2010, WWW '10.

[85]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[86]  A. Laub,et al.  The singular value decomposition: Its computation and some applications , 1980 .

[87]  Daqing Zhang,et al.  Fine-grained preference-aware location search leveraging crowdsourced digital footprints from LBSNs , 2013, UbiComp.

[88]  Xing Xie,et al.  Urban computing with taxicabs , 2011, UbiComp '11.

[89]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[90]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[91]  Nicholas Jing Yuan,et al.  Segmentation of Urban Areas Using Road Networks , 2012 .

[92]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[93]  Xing Xie,et al.  Discovering spatio-temporal causal interactions in traffic data streams , 2011, KDD.